i

Copyright © 2013 Dr. Martin Jones
This work is licensed under a Creative Commons Attribution-NonCommercial-
ShareAlike 3.0 Unported License.
For more information, visit http://pythonforbioogists.!om
"et in #$ "erif an% Source Code Pro
ii
About the author
Martin starte% his programming !areer by earning #er %&ring the !o&rse of
his #hD in evo&tionary bioogy, an% starte% tea!hing other peope to
program soon after. "in!e then he has ta&ght intro%&!tory programming to
h&n%re%s of bioogists, from &n%ergra%&ates to #'s, an% has maintaine% a
phiosophy that programming !o&rses m&st be frien%y, approa!habe, an%
pra!ti!a.
Martin has ta&ght intro%&!tory programming as part of the (ioinformati!s
M"! !o&rse at )%inb&rgh *niversity for the past five years, an% is !&rrenty
+e!t&rer in (ioinformati!s.
iii
Preface
,e!ome to #ython for (ioogists.
(efore yo& rea% any f&rther, ma-e s&re that this is the most re!ent version of
the boo-. #ython for (ioogists is being !ontin&ay &p%ate% an% improve% to
ta-e into a!!o&nt !orre!tions, amen%ments an% !hanges to #ython itsef, so
it.s important that yo& are rea%ing the most &p/to/%ate version.
$his fie is revision n&mber 200. $he n&mber of the most re!ent revision !an
a0ays be fo&n% at:
http://pythonforbioogists.!om/in%e1.php/version/
'f the revision n&mber iste% at the *2+ is higher than the one in bo%, then
this is an o&t/of/%ate !opy, an% yo& nee% to %o0noa% the atest version from
http://pythonforbioogists.!om
3o&. noti!e from the !opyright page that the !ontents of this boo- are
i!ense% &n%er a Creative Commons 4ttrib&tion "hare4i-e i!ense. $his
means that yo&.re free to %o 0hat yo& i-e 0ith it 5 !opy it, emai it to yo&r
frien%s, 0apaper yo&r ab 0ith it 5 as ong as yo& -eep the attrib&tion. 3o&
!an aso mo%ify it, as ong as yo& i!ense yo&r mo%ifi!ation &n%er the same
terms. $he ony thing that the i!ense %oesn.t ao0 is !ommer!ia &se 5 if
yo&.% i-e to &se the !ontents of this !o&rse for !ommer!ia p&rposes, get in
to&!h 0ith me at
martin6pythonforbioogists.!om
7appy programming8
iv
Table of Contents
About the author » ii
Preface » iii
1: Introduction and environment 1
Why have a programming book for biologists? » 1
Why Python? » 2
How to use this book » 5
!ercises an" solutions » #
$etting in touch » %
&etting up your environment » %
'e!t e"itors » 11
(ea"ing the "ocumentation » 12
2: Printing and manipulating text 13
Why are we so intereste" in working with te!t? » 1)
Printing a message to the screen » 1*
+uotes are important » 15
,se comments to annotate your co"e » 1-
rror messages an" "ebugging » 1%
Printing special characters » 21
&toring strings in variables » 21
'ools for manipulating strings » 2*
(ecap » )*
!ercises » )-
&olutions » ).
3: Reading and writing files 2
Why are we so intereste" in working with files? » 52
(ea"ing te!t from a file » 5)
/iles0 contents an" file names » 55
1ealing with newlines » 5#
2issing files » -3
v
Writing te!t to files » -3
4losing files » -)
Paths an" fol"ers » -)
(ecap » -*
!ercises » -5
&olutions » -#
!: "ists and loops #!
Why "o we nee" lists an" loops? » #*
4reating lists an" retrieving elements » #-
Working with list elements » ##
Writing a loop » #.
5n"entation errors » %2
,sing a string as a list » %)
&plitting a string to make a list » %*
5terating over lines in a file » %*
6ooping with ranges » %5
(ecap » %#
!ercises » %.
&olutions » .3
: $riting our own functions %%
Why "o we want to write our own functions? » ..
1efining a function » 133
4alling an" improving our function » 13)
ncapsulation with functions » 135
/unctions "on7t always have to take an argument » 13-
/unctions "on7t always have to return a value » 13%
/unctions can be calle" with name" arguments » 13%
/unction arguments can have "efaults » 113
'esting functions » 111
(ecap » 11)
!ercises » 115
&olutions » 11-
vi
&: Conditional tests 121
Programs nee" to make "ecisions » 121
4on"itions0 'rue an" /alse » 121
if statements » 12*
else statements » 125
elif statements » 12-
while loops » 12%
8uil"ing up comple! con"itions » 12%
Writing true9false functions » 1)3
(ecap » 1)1
!ercises » 1))
&olutions » 1)5
#: Regular expressions 1!1
'he importance of patterns in biology » 1*1
2o"ules in Python » 1*)
(aw strings » 1**
&earching for a pattern in a string » 1*5
!tracting the part of the string that matche" » 153
$etting the position of a match » 152
&plitting a string using a regular e!pression » 15)
/in"ing multiple matches » 15*
(ecap » 155
!ercises » 15#
&olutions » 15%
': (ictionaries 1&'
&toring paire" "ata » 1-%
4reating a "ictionary » 1#)
5terating over a "ictionary » 1#.
(ecap » 1%2
!ercises » 1%)
&olutions » 1%*
%: )iles* programs* and user input 1%
vii
/ile contents an" manipulation » 1.5
8asic file manipulation » 1.-
1eleting files an" fol"ers » 1.%
6isting fol"er contents » 1.%
(unning e!ternal programs » 1..
(unning a program » 233
&aving program output » 231
,ser input makes our programs more fle!ible » 231
5nteractive user input » 23)
4omman" line arguments » 23*
(ecap » 235
!ercises » 23#
&olutions » 23%
1 Chapter 1: 'ntro%&!tion an% environment
1: Introduction and environment
Why have a programming book for biologists?
'f yo&.re rea%ing this boo-, then yo& probaby %on.t nee% to be !onvin!e% that
programming is be!oming an in!reasingy essentia part of the too -it for
bioogists of a types. 3o& might, ho0ever, nee% to be !onvin!e% that a boo- i-e
this one, %eveope% espe!iay for bioogists, !an %o a better 9ob of tea!hing yo& to
program than a genera/p&rpose intro%&!tory programming boo-. 7ere are a fe0 of
the reasons 0hy ' thin- that is the !ase.
4 bioogy/spe!ifi! programming boo- ao0s &s to &se e1ampes an% e1er!ises that
&se bioogi!a probems. $his serves t0o important p&rposes: firsty, it provi%es
motivation an% %emonstrates the types of probems that programming !an hep to
sove. )1perien!e has sho0n that beginners ma-e m&!h better progress 0hen they
are motivate% by the tho&ght of ho0 the programs they 0rite 0i ma-e their ife
easier8 "e!on%y, by &sing bioogi!a e1ampes, the !o%e an% e1er!ises thro&gho&t
the boo- !an form a ibrary of &sef& !o%e snippets, 0hi!h 0e !an refer ba!- to
0hen 0e 0ant to sove rea/ife probems. 'n bioogy, as in a fie%s of
programming, the same probems ten% to re!&r time an% time again, so it.s very
&sef& to have this !oe!tion of e1ampes to a!t as a referen!e 5 something that.s
not possibe 0ith a genera/p&rpose programming boo-.
4 bioogy/spe!ifi! programming boo- !an aso !on!entrate on the feat&res of the
ang&age that are most &sef& to bioogists. 4 ang&age i-e #ython has many
feat&res an% in the !o&rse of earning it 0e inevitaby have to !on!entrate on some
an% miss others o&t. $he set of feat&res 0hi!h are important to &s in bioogy are
sighty %ifferent to those 0hi!h are most &sef& for genera/p&rpose programming
5 for e1ampe, 0e are m&!h more intereste% in manip&ating te1t :in!&%ing things
i-e D;4 an% protein se<&en!es= than the average programmer. 4so, there are
severa feat&res of #ython that 0o&% not normay be %is!&sse% in an intro%&!tory
2 Chapter 1: 'ntro%&!tion an% environment
programming boo-, b&t 0hi!h are very &sef& to bioogists :for e1ampe, reg&ar
e1pressions an% s&bpro!esses=. 7aving a bioogy/spe!ifi! te1tboo- ao0s &s to
in!&%e these feat&res, aong 0ith e1panations of 0hy they are parti!&ary &sef&
to &s.
4 reate% point is that a te1tboo- 0ritten 9&st for bioogists ao0s &s to intro%&!e
feat&res in a 0ay that ao0s &s to start 0riting &sef& programs right a0ay. ,e !an
%o this by ta-ing into a!!o&nt the sorts of probems that repeate%y !rop &p in
bioogy, an% prioritising the feat&res that are best at soving them. $his boo- has
been %esigne% so that yo& sho&% be abe to start 0riting sma b&t &sef& programs
&sing ony the toos in the first !o&pe of !hapters.
Why Python?
+et me start this se!tion 0ith the foo0ing statement: programming ang&ages are
overrate%. ,hat ' mean by that is that peope 0ho are ne0 to programming ten% to
0orry far too m&!h abo&t 0hat ang&age to earn. $he !hoi!e of programming
ang&age %oes matter, of !o&rse, b&t it matters far ess than peope thin- it %oes. $o
p&t it another 0ay, !hoosing the >0rong> programming ang&age is very &ni-ey to
mean the %ifferen!e bet0een fai&re an% s&!!ess 0hen earning. ?ther fa!tors
:motivation, having time to %evote to earning, hepf& !oeag&es= are far more
important, yet re!eive ess attention.
$he reason that peope pa!e so m&!h 0eight on the :what language shoul" 5 learn?:
<&estion is that it.s a big, obvio&s <&estion, an% it.s not %iffi!&t to fin% peope 0ho
0i give yo& strong opinions on the s&b9e!t. 't.s aso the first big <&estion that
beginners have to ans0er on!e they.ve %e!i%e% to earn programming, so it ass&mes
a great %ea of importan!e in their min%s.
$here are three main reasons 0hy !hoi!e of programming ang&age is not as
important as most peope thin- it is. Firsty, neary everybo%y 0ho spen%s any
signifi!ant amo&nt of time programming as part of their 9ob 0i event&ay en% &p
3 Chapter 1: 'ntro%&!tion an% environment
&sing m&tipe ang&ages. #arty this is 9&st %o0n to the simpe !onstraints of
vario&s ang&ages 5 if yo& 0ant to 0rite a 0eb appi!ation yo&. probaby %o it in
Javas!ript, if yo& 0ant to 0rite a graphi!a &ser interfa!e yo&. probaby &se
something i-e Java, an% if yo& 0ant to 0rite o0/eve agorithms yo&. probaby
&se C.
"e!on%y, earning a first programming ang&age gets yo& @0A of the 0ay to0ar%s
earning a se!on%, thir%, an% fo&rth one. +earning to thin- i-e a programmer in the
0ay that yo& brea- %o0n !ompe1 tas-s into simpe ones is a s-i that !&ts a!ross
a ang&ages 5 so if yo& spen% a fe0 months earning #ython an% then %is!over
that yo& reay nee% to 0rite in C, yo&r time 0on.t have been 0aste% as yo&. be
abe to pi!- it &p m&!h <&i!-er.
$hir%y, the -in%s of probems that 0e 0ant to sove in bioogy are generay
amenabe to being sove% in any ang&age, even tho&gh %ifferent programming
ang&ages are goo% at %ifferent things. 'n other 0or%s, as a beginner, yo&r !hoi!e of
ang&age is vanishingy &ni-ey to prevent yo& from soving the probems that yo&
nee% to sove.
7aving sai% a that, 0hen earning to program 0e "o nee% to pi!- a ang&age to
0or- in, so 0e might as 0e pi!- one that.s going to ma-e the 9ob easier. #ython is
s&!h a ang&age for a n&mber of reasons:
• 't has a mosty/!onsistent synta1, so yo& !an generay earn one 0ay of
%oing things an% then appy it in m&tipe pa!es
• 't has a sensibe set of b&it/in ibraries for %oing ots of !ommon tas-s
• 't is %esigne% in s&!h a 0ay that there.s an obvio&s 0ay of %oing most things
• 't.s one of the most 0i%ey/&se% ang&ages in the 0or%, an% there.s a ot of
a%vi!e, %o!&mentation an% t&torias avaiabe on the 0eb
• 't.s %esigne% in a 0ay that ets yo& start to 0rite &sef& programs as soon as
possibe
B Chapter 1: 'ntro%&!tion an% environment
• 'ts &se of in%entation, 0hie annoying to peope 0ho aren.t &se% to it, is
great for beginners as it enfor!es a !ertain amo&nt of rea%abiity
#ython aso has a !o&pe of points to re!ommen% it to bioogists an% s!ientists
spe!ifi!ay:
• 't.s 0i%ey &se% in the s!ientifi! !omm&nity
• 't has a !o&pe of very 0e/%esigne% ibraries for %oing !ompe1 s!ientifi!
!omp&ting :atho&gh 0e 0on.t en!o&nter them in this boo-=
• 't en% itsef 0e to being integrate% 0ith other, e1isting toos
• 't has feat&res 0hi!h ma-e it easy to manip&ate strings of !hara!ters :for
e1ampe, strings of D;4 bases an% protein amino a!i% resi%&es, 0hi!h 0e as
bioogists are parti!&ary fon% of=
P+t,on vs- Perl
For bioogists, the <&estion :what language shoul" 5 learn: often reay !omes %o0n
to the <&estion :shoul" 5 learn Perl or Python?:, so et.s ans0er it hea% on. #er an%
#ython are both perfe!ty goo% ang&ages for soving a 0i%e variety of bioogi!a
probems. 7o0ever, after e1tensive e1perien!e tea!hing both #er an% #ython to
bioogists, '.ve !ome the !on!&sion that #ython is an easier ang&age to earn by
virt&e of being more consistent an% more readable.
4n important thing to &n%erstan% abo&t #er an% #ython is that they are incre"ibly
simiar :%espite the fa!t that they oo- very %ifferent=, so the point above abo&t
earning a se!on% ang&age appies %o&by. Many #ython an% #er feat&res have a
one/to/one !orrespon%en!e, an% so earning #er after earning #ython 0i be
reativey easy 5 m&!h easier than, for e1ampe, moving to Java or C.
C Chapter 1: 'ntro%&!tion an% environment
How to use this book
#rogramming boo-s generay fa into t0o !ategoriesD referen!e/type boo-s, 0hi!h
are %esigne% for oo-ing &p spe!ifi! bits of information, an% t&toria/type boo-s,
0hi!h are %esigne% to be rea% !over/to/!over. $his boo- is an e1ampe of the atter
5 !o%e sampes in ater !hapters often &se materia from previo&s ones, so yo& nee%
to ma-e s&re yo& rea% the !hapters in or%er. )1er!ises or e1ampes from one
!hapter are sometimes &se% to i&strate the nee% for feat&res that are intro%&!e%
in the ne1t.
$here are a n&mber of f&n%amenta programming !on!epts that are reevant to
materia in m&tipe %ifferent !hapters. 'n this boo-, rather than intro%&!e these
!on!epts a in one go, '.ve trie% to e1pain them as they be!ome ne!essary. $his
res&ts in a ten%en!y for earier !hapters to be onger than ater ones, as they
invove the intro%&!tion of more ne0 !on!epts.
4 !ertain amo&nt of 9argon is ne!essary if 0e 0ant to ta- abo&t programs an%
programming !on!epts. '.ve trie% to %efine ea!h ne0 te!hni!a term at the point
0here it.s intro%&!e%, an% then &se it thereafter 0ith o!!asiona remin%ers of the
meaning.
Chapters ten% to foo0 a pre%i!tabe str&!t&re. $hey generay start 0ith a fe0
paragraphs o&tining the motivation behin% the feat&res that it 0i !over 5 0hy %o
they e1ist, 0hat probems %o they ao0 &s to sove, an% 0hy are they &sef& in
bioogy spe!ifi!ayE $hese are foo0e% by the main bo%y of the !hapter in 0hi!h
0e %is!&ss the reevant feat&res an% ho0 to &se them. $he ength of the !hapters
varies <&ite a ot 5 sometimes 0e 0ant to !over a topi! briefy, other times 0e nee%
more %epth. $his se!tion en%s 0ith a brief re!ap o&tining 0hat 0e have earne%,
foo0e% by e1er!ises an% so&tions :more on that topi! beo0=.
F Chapter 1: 'ntro%&!tion an% environment
)urt,er reading
'.ve %eiberatey imite% the s!ope of this boo- to intro%&!tory materia, in or%er to
-eep the siGe manageabe. 4s a res&t, there are ots of &sef& te!hni<&es an% toos
that '.ve ha% to eave o&t. $he goo% st&ff that ' !o&%n.t fit into this boo- forms the
basis of my se!on% boo-, A"vance" Python for 8iologists. 3o& !an rea% more abo&t
4%van!e% #ython for (ioogists at this *2+:
http://pythonforbiologists.com/ap4b
$here are severa toos an% te!hni<&es that are %is!&sse% ony briefy in this boo-,
b&t in m&!h more %epth in A"vance" Python for 8iologists. 2ather than repeating
the *2+ ea!h time, ' have 9&st mentione% the reevant !hapters in the te1t or in
footnotes. 7opef&y this sho&% ao0 yo& to easiy fin% the !orrespon%ing bit in
the a%van!e% boo- 0hen yo& 0ant to rea% abo&t a parti!&ar topi! in more %epth.
/ormatting
4 !o&pe of notes on typography: bold t+pe is &se% to emphasiGe important points
an% italics for te!hni!a terms an% fie names. ,here !o%e is mi1e% in 0ith norma
te1t it.s 0ritten in a mono-spaced font like this. ?!!asionay there are
footnotes
1
to provi%e a%%itiona information that is interesting to -no0 b&t not
!r&!ia to &n%erstan%ing, or to give in-s to 0eb pages.
)1ampe !o%e is highighte% 0ith a soi% bor%er:
Some example code goes here
1 +i-e this.
H Chapter 1: 'ntro%&!tion an% environment
an% e1ampe o&tp&t :i.e. 0hat 0e see on the s!reen 0hen 0e r&n the !o%e= is
highighte% 0ith a %otte% bor%er:
Some output goes here
?ften 0e 0ant to oo- at the !o%e an% the o&tp&t it pro%&!es together. 'n these
sit&ations, yo&. see a soi%/bor%ere% !o%e bo!- foo0e% imme%iatey by a %otte%/
bor%ere% o&tp&t bo!-.
"ometimes it.s ne!essary to refer in the te1t to in%ivi%&a ines of !o%e or o&tp&t, in
0hi!h !ase '.ve &se% ine n&mberings on the eft:
first line
second line
third line
?ther bo!-s of te1t :&s&ay fie !ontents or type% !omman% ines= %on.t have any
-in% of bor%er an% oo- i-e this:
contents of a file
!ercises an" solutions
$he fina part of ea!h !hapter is a set of e1er!ises an% so&tions. $he n&mber an%
!ompe1ity of e1er!ises %iffer greaty bet0een !hapters %epen%ing on the nat&re of
the materia. 4s a r&e, eary !hapters have a arge n&mber of simpe e1er!ises,
0hie ater !hapters have a sma n&mber of more !ompe1 ones. Many of the
e1er!ise probems are 0ritten in a %eiberatey vag&e manner an% the e1a!t %etais
of ho0 the so&tions 0or- is &p to yo& :very m&!h i-e rea/ife programming8= 3o&
!an a0ays oo- at the so&tions to see one possibe 0ay of ta!-ing the probem,
b&t there are often m&tipe vai% approa!hes.
1
2
3
I Chapter 1: 'ntro%&!tion an% environment
' strongy re!ommen% that yo& try ta!-ing the e1er!ises yo&rsef before rea%ing
the so&tionsD there reay is no s&bstit&te for pra!ti!a e1perien!e 0hen earning to
program. ' aso en!o&rage yo& to a%opt an attit&%e of !&rio&s e1perimentation
0hen 0or-ing on the e1er!ises 5 if yo& fin% yo&rsef 0on%ering if a parti!&ar
variation on a probem is sovabe, or if yo& re!ogniGe a !osey/reate% probem
from yo&r o0n 0or-, try soving it8 Contin&o&s e1perimentation is a -ey part of
%eveoping as a programmer, an% the <&i!-est 0ay to fin% o&t 0hat a parti!&ar
f&n!tion or feat&re 0i %o is to try it.
$he e1ampe so&tions to e1er!ises are 0ritten in a %ifferent 0ay to most
programming te1tboo-s: rather than simpy present the finishe% so&tion, ' have
o&tine% the tho&ght pro!esses invove% in soving the e1er!ises an% sho0n ho0
the so&tion is b&it &p step/by/step. 7opef&y this approa!h 0i give yo& an
insight into the probem/soving min%set that programming re<&ires. 't.s probaby
a goo% i%ea to rea% thro&gh the so&tions even if yo& s&!!essf&y sove the e1er!ise
probems yo&rsef, as they sometimes s&ggest an approa!h that is not imme%iatey
obvio&s.
$etting in touch
?ne of the most !onvin!ing arg&ments for presenting a !o&rse i-e this one in the
form of an eboo- is that it !an be !ontin&ay &p%ate% an% t0ea-e% base% on rea%er
fee%ba!-. "o, if yo& fin% anything that is har% to &n%erstan%, or yo& thin- may
!ontain an error, pease get in to&!h 5 9&st %rop me an emai at
martin6pythonforbioogists.!om an% ' promise to get ba!- to yo&.
&etting up your environment
4 that yo& nee% in or%er to foo0 the e1ampes an% e1er!ises in this boo- is a
stan%ar% #ython instaation an% a te1t e%itor. 4 the !o%e in this boo- 0i r&n on
either +in&1, Ma! or ,in%o0s ma!hines. $he sight %ifferen!es bet0een operating
@ Chapter 1: 'ntro%&!tion an% environment
systems are e1paine% in the te1t :mosty in !hapter @=. 'f yo& have a !hoi!e of
operating systems on 0hi!h to earn #ython, ' re!ommen% +in&1, Ma! ?"J an%
,in%o0s in that or%er, simpy be!a&se the *;'J/base% operating systems :+in&1
an% ?"J= are more amenabe to programming in genera.
Installing P+t,on
$he pro!ess of instaing #ython %epen%s on the type of !omp&ter yo&.re r&nning
on. 'f yo&.re r&nning a mainstream +in&1 %istrib&tion i-e *b&nt&, #ython is
probaby area%y instae%. $o fin% o&t, open a termina an% type
python
'f yo& see some o&tp&t aong these ines:
Python 2.7.3 (default !pr "# 2#"3 #$:"3:"%&
'()) 4.7.2* on linux2
+ype ,help, ,copyright, ,credits, or ,license, for more information.
---
$hen yo& are rea%y to go. 'f yo&r +in&1 instaation %oesn.t area%y have #ython
instae%, try instaing it 0ith yo&r pa!-age manager :the !omman% 0i probaby
be either sudo apt-get install python or sudo yum install python=.
'f this %oesn.t 0or-, then %o0noa% the pa!-age from the #ython %o0noa% page
1
.
$he offi!ia #ython 0ebsite has instaation instr&!tions for Ma!
2
an% ,in%o0s
3

!omp&ters as 0eD these are i-ey to be the most &p/to/%ate instr&!tions, so foo0
them !osey.
1 http://000.python.org/getit/
2 http://000.python.org/getit/ma!/
3 http://000.python.org/getit/0in%o0s/
10 Chapter 1: 'ntro%&!tion an% environment
Running P+t,on programs
4 #ython program is 9&st a norma te1t fie that !ontains #ython !o%e. $o r&n it 0e
m&st first open &p a !omman% ine. ?n +in&1 an% Ma! !omp&ters, the appi!ation
to %o this 0i be !ae% something aong the ines of >termina>. ?n ,in%o0s, it is
-no0n as >!omman% prompt>.
$o r&n a #ython program, 0e 9&st type the path to the #ython e1e!&tabe foo0e%
by the name of the fie that !ontains the !o%e 0e 0ant to r&n
1
. ?n a +in&1 or Ma!
ma!hine, the path 0i be something i-e:
/usr/local/bin/python
?n ,in%o0s, it 0i be something i-e:
c:.Python27.python
$o r&n a #ython program, it.s generay easiest to be in the same fo%er as it. (y
!onvention, #ython programs are given the e1tension .py, so to r&n a program
!ae% test.py, 0e 9&st type:
/usr/local/bin/python test.py
$here are a !o&pe of tri!-s that !an be &sef& 0hen e1perimenting 0ith programs
2
.
Firsty, yo& !an r&n #ython in an intera!tive :or >she>= mo%e by r&nning it 0itho&t
the name of a program fie. $his ao0s yo& to type in%ivi%&a statements an% see
the res&t straight a0ay.
1 ,hen 0e refer to >a #ython program> in this boo-, 0e are &s&ay ta-ing abo&t the te1t fie that ho%s the
!o%e.
2 Don.t 0orry if these t0o options ma-e no sense to yo& right no0 5 they 0i %o so ater on in the boo-, on!e
yo&.ve earne% 0hat statements an% variabes a!t&ay are.
11 Chapter 1: 'ntro%&!tion an% environment
"e!on%y, yo& !an r&n #ython 0ith the -i option, 0hi!h 0i !a&se it to r&n yo&r
program an% t,en enter intera!tive mo%e. $his !an be han%y if yo& 0ant to
e1amine the state of variabes after yo&r !o%e has r&n.
P+t,on 2 vs- P+t,on 3
4s 0i <&i!-y be!ome !ear if yo& spen% any amo&nt of time on the offi!ia #ython
0ebsite, there are t0o versions of #ython !&rrenty avaiabe. $he #ython 0or% is,
at the time of 0riting, in the mi%%e of a transition from version 2 to version 3. 4
%is!&ssion of the pros an% !ons of ea!h version is 0e beyon% the s!ope of this
boo-
1
, b&t here.s 0hat yo& nee% to -no0: insta #ython 3 if possibe, b&t if yo& en%
&p 0ith #ython 2, %on.t 0orry 5 a the !o%e e1ampes in the boo- 0i 0or- 0ith
both versions.
'f yo&.re going to &se #ython 2, there is 9&st one thing that yo& have to %o in or%er
to ma-e some of the !o%e e1ampes 0or-: in!&%e this ine at the start of a yo&r
programs:
from //future// import di0ision
,e 0on.t go into the e1panation behin% this ine, e1!ept to say that it.s ne!essary
in or%er to !orre!t a sma <&ir- 0ith the 0ay that #ython 2 han%es %ivision of
n&mbers.
Depen%ing on 0hat version yo& &se, yo& might see sight %ifferen!es bet0een the
o&tp&t in this boo- an% the o&tp&t yo& get 0hen yo& r&n the !o%e on yo&r
!omp&ter. '.ve trie% to note these %ifferen!es in the te1t 0here possibe.
1 3o& might en!o&nter 0riting onine that ma-es the 2 to 3 !hangeover seem i-e a big %ea, an% it is 5 b&t
ony for e1isting, arge pro9e!ts. ,hen 0riting !o%e from s!rat!h, as yo&. be %oing 0hen earning, yo&.re
&ni-ey to r&n into any probems.
12 Chapter 1: 'ntro%&!tion an% environment
'e!t e"itors
"in!e a #ython program is 9&st a te1t fie, yo& !an !reate an% e%it it 0ith any te1t
e%itor of yo&r !hoi!e. ;ote that by a te1t e%itor ' don.t mean a 0or% pro!essor 5 %o
not try to e%it #ython programs 0ith Mi!rosoft ,or%, +ibre?ffi!e ,riter, or simiar
toos, as they ten% to insert spe!ia formatting mar-s that #ython !annot rea%.
,hen !hoosing a te1t e%itor, there is one feat&re that is essentia
1
to have, an% one
0hi!h is ni!e to have. $he essentia feat&re is something that.s &s&ay !ae% tab
emulation. $he effe!t of this feat&re at first seems <&ite o%%D 0hen enabe%, it
repa!es any tab !hara!ters that yo& type 0ith an e<&ivaent n&mber of spa!e
!hara!ters :&s&ay set to fo&r=. $he reason 0hy this is &sef& is %is!&sse% at ength
in !hapter B, b&t here.s a brief e1panation: #ython is very f&ssy abo&t yo&r &se of
tabs an% spa!es, an% &ness yo& are very %is!ipine% 0hen typing, it.s easy to en% &p
0ith a mi1t&re of tabs an% spa!es in yo&r programs. $his !a&ses very inf&riating
probems, be!a&se they oo- the same to yo&, b&t not to #ython8 $ab em&ation
fi1es the probem by ma-ing it effe!tivey impossibe for yo& to type a tab
!hara!ter.
$he feat&re that is ni!e to have is synta! highlighting. $his 0i appy %ifferent
!oo&rs to %ifferent parts of yo&r #ython !o%e, an% !an hep yo& spot errors more
easiy.
2e!ommen%e% te1t e%itors are /otepad00 for ,in%o0s
2
, Text$rangler for Ma!
?"J
3
, an% gedit for +in&1
B
, a of 0hi!h are freey avaiabe.
?n the 0eb an% ese0here yo& may see referen!es to #ython 'D)s. 'D) stan%s for
'ntegrate% Deveopment )nvironment, an% they typi!ay !ombine a te1t e%itor
0ith a !oe!tion of other &sef& programming toos. ,hie they !an spee% &p
1 ?K, so it.s not stri!ty essentia, b&t yo& 0i fin% ife m&!h easer if yo& have it.
2 http://notepa%/p&s/p&s.org/
3 http://000.barebones.!om/pro%&!ts/$e1t,ranger/
B https://pro9e!ts.gnome.org/ge%it/
13 Chapter 1: 'ntro%&!tion an% environment
%eveopment for e1perien!e% programmers, they.re not a goo% i%ea for beginners as
they !ompi!ate things, so ' %on.t re!ommen% yo& &se them.
(ea"ing the "ocumentation
#art of the tea!hing phiosophy that '.ve &se% in 0riting this boo- is that it.s better
to intro%&!e a fe0 &sef& feat&res an% f&n!tions rather than over0hem yo& 0ith a
!omprehensive ist. $he best pa!e to go 0hen yo& %o 0ant a !ompete ist of the
options avaiabe in #ython is the offi!ia %o!&mentation
1
0hi!h, !ompare% to
many ang&ages, is very rea%abe.
1 http://000.python.org/%o!/
1B Chapter 2: #rinting an% manip&ating te1t
2: Printing and manipulating text
Why are we so intereste" in working with te!t?
?pen the first page of a boo- abo&t earning #ython
1
, an% the !han!es are that the
first e1ampes of !o%e yo&. see invove numbers. $here.s a goo% reason for that:
n&mbers are generay simper to 0or- 0ith than te1t 5 there are not too many
things yo& !an %o 0ith them :on!e yo&.ve got basi! arithmeti! o&t of the 0ay= an%
so they en% themseves 0e to e1ampes that are easy to &n%erstan%. 't.s aso a
pretty safe bet that the average person rea%ing a programming boo- is %oing so
be!a&se they nee% to %o some n&mber/!r&n!hing.
"o 0hat ma-es this boo- %ifferent 5 0hy is this first !hapter abo&t te1t rather than
n&mbersE $he ans0er is that, as bioogists, 0e have a parti!&ar interest in %eaing
0ith te1t rather than n&mbers :tho&gh of !o&rse, 0e. nee% to earn ho0 to
manip&ate n&mbers too=. "pe!ifi!ay, 0e.re intereste% in parti!&ar types of te1t
that 0e !a se;uences < the D;4 an% protein se<&en!es that !onstit&te the %ata
that 0e %ea 0ith in bioogy.
$here are other reasons that 0e have a greater interest in 0or-ing 0ith te1t than
the average novi!e programmer. 4s s!ientists, the programs that 0e 0rite often
nee% to 0or- as part of a pipeine, aongsi%e other programs that have been 0ritten
by other peope. $o %o this, 0e. often nee% to 0rite !o%e that !an understand the
o&tp&t from some other program :0e !a this parsing= or produce o&tp&t in a
format that another program !an operate on. (oth of these tas-s re<&ire
manip&ating te1t.
'.ve hinte% above that !omp&ters !onsi%er n&mbers an% te1t to be %ifferent in some
0ay. $hat.s an important i%ea, an% one that 0e. ret&rn to in more %etai ater. For
no0, ' 0ant to intro%&!e an important pie!e of 9argon 5 the 0or% string. "tring is
1 ?r in%ee%, any other programming ang&age
1C Chapter 2: #rinting an% manip&ating te1t
the 0or% 0e &se to refer to a bit of te1t in a !omp&ter program :it 9&st means a
string of !hara!ters=. From this point on 0e. &se the 0or% string 0hen 0e.re ta-ing
abo&t !omp&ter !o%e, an% 0e. reserve the 0or% se;uence for 0hen 0e.re %is!&ssing
bioogi!a se<&en!es i-e D;4 an% protein.
#rinting a message to the s!reen
$he first thing 0e.re going to earn is ho0 to print
1
a message to the s!reen. 7ere.s a
ine of #ython !o%e that 0i !a&se a frien%y message to be printe%. L&i!-
remin%er: soi% ines in%i!ate #ython !o%e, %otte% ines in%i!ate o&tp&t.
print(,1ello 2orld,&
+et.s ta-e a oo- at the vario&s bits of this ine of !o%e, an% give some of them
names:
$he 0hoe ine is !ae% a statement.
print is the name of a function= $he f&n!tion tes #ython, in vag&e terms, 0hat 0e
0ant to %o 5 in this !ase, 0e 0ant to print some te1t. $he f&n!tion name is a0ays
2

foo0e% by parentheses
3
.
$he bits of te1t insi%e the parentheses are !ae% the arguments to the f&n!tion. 'n
this !ase, 0e 9&st have one arg&ment :ater on 0e. see e1ampes of f&n!tions that
ta-e more than one arg&ment, in 0hi!h !ase the arg&ments are separate% by
!ommas=.
1 ,hen 0e ta- abo&t printing te1t insi%e a !omp&ter program, 0e are not ta-ing abo&t pro%&!ing a
%o!&ment on a printer. $he 0or% >print> is &se% for any o!!asion 0hen o&r program o&tp&ts some te1t 5 in
this !ase, the o&tp&t is %ispaye% in yo&r termina.
2 $his is not stri!ty tr&e, b&t it.s easier to 9&st foo0 this r&e than 0orry abo&t the e1!eptions.
3 $here are severa %ifferent types of bra!-ets in #ython, so for !arity 0e 0i a0ays refer to parentheses
0hen 0e mean these: :=, s;uare brackets 0hen 0e mean these: MN an% curly brackets 0hen 0e mean these: OP
1F Chapter 2: #rinting an% manip&ating te1t
$he arg&ments te #ython 0hat 0e 0ant to %o more spe!ifi!ay 5 in this !ase, the
arg&ment tes #ython e1a!ty 0hat it is 0e 0ant to print: a frien%y greeting.
4ss&ming yo&.ve foo0e% the instr&!tions in !hapter 1 an% set &p yo&r #ython
environment, type the ine of !o%e above into yo&r favo&rite te1t e%itor, save it, an%
r&n it. 3o& sho&% see a singe ine of o&tp&t i-e this:
1ello 2orld
L&otes are important
'n norma 0riting, 0e ony s&rro&n% a bit of te1t in <&otes 0hen 0e 0ant to sho0
that they are being sai% by somebo%y. 'n #ython, ho0ever, strings are alwa+s
s&rro&n%e% by <&otes. $hat is ho0 #ython is abe to te the %ifferen!e bet0een the
instr&!tions :i-e the f&n!tion name= an% the %ata :the thing 0e 0ant to print=. ,e
!an &se either singe or %o&be <&otes for strings 5 #ython 0i happiy a!!ept
either. $he foo0ing t0o statements behave e1a!ty the same:
print(,1ello 2orld,&
print(31ello 2orld3&
+et.s ta-e a oo- at the o&tp&t to prove it
1
:
1ello 2orld
1ello 2orld
3o&. noti!e that the o&tp&t above %oesn.t !ontain <&otes 5 they are part of the
!o%e, not part of the string itsef. 'f 0e do 0ant to in!&%e <&otes in the o&tp&t, the
easiest thing to %o
2
is &se the other type of <&otes for s&rro&n%ing the string:
1 From this point on, ' 0on.t te yo& to !reate a ne0 fie, enter the te1t, an% r&n the program for ea!h
e1ampe 5 ' 0i simpy sho0 yo& the o&tp&t 5 b&t ' en!o&rage yo& to try the e1ampes yo&rsef.
2 $he aternative is to pa!e a ba!-sash !hara!ter :Q= before the <&ote 5 this is !ae% escaping the <&ote an%
1H Chapter 2: #rinting an% manip&ating te1t
print(,She said 31ello 2orld3,&
print(31e said ,1ello 2orld,3&
$he above !o%e 0i give the foo0ing o&tp&t:
She said 31ello 2orld3
1e said ,1ello 2orld,
(e !aref& 0hen 0riting an% rea%ing !o%e that invoves <&otes 5 yo& have to ma-e
s&re that the <&otes at the beginning an% en% of the string mat!h &p.
*se !omments to annotate yo&r !o%e
?!!asionay, 0e 0ant to 0rite some te1t in a program that is for h&mans to rea%,
rather than for the !omp&ter to e1e!&te. ,e !a this type of ine a comment. $o
in!&%e a !omment in yo&r so&r!e !o%e, start the ine 0ith a hash symbo
1
:
4 this is a comment it 2ill be ignored by the computer
print(,)omments are 0ery useful5,&
3o&.re going to see a ot of !omments in the so&r!e !o%e e1ampes in this boo-, an%
aso in the so&tions to the e1er!ises. Comments are a very &sef& 0ay to %o!&ment
yo&r !o%e, for a n&mber of reasons:
• 3o& !an p&t the e1panation of 0hat a parti!&ar bit of !o%e %oes right ne1t
to the !o%e itsef. $his ma-es it m&!h easier to fin% the %o!&mentation for a
ine of !o%e that is in the mi%%e of a arge program, 0itho&t having to
sear!h thro&gh a separate %o!&ment.
0i prevent #ython from trying to interpret it.
1 $his symbo has many names 5 yo& might -no0 it as n&mber sign, po&n% sign, o!tothorpe, sharp :from
m&si!a notation=, !ross, or pig/pen.
1I Chapter 2: #rinting an% manip&ating te1t
• (e!a&se the !omments are part of the so&r!e !o%e, they !an never get mi1e%
&p or separate%. 'n other 0or%s, if yo& are oo-ing at the so&r!e !o%e for a
parti!&ar program, then yo& a&tomati!ay have the %o!&mentation as 0e.
'n !ontrast, if yo& -eep the %o!&mentation in a separate fie, it !an easiy
be!ome separate% from the !o%e.
• 7aving the !omments right ne1t to the !o%e a!ts as a remin%er to &p%ate the
%o!&mentation 0henever yo& !hange the !o%e. $he ony thing 0orse than
&n%o!&mente% !o%e is !o%e 0ith o% %o!&mentation that is no onger
a!!&rate8
Don.t ma-e the mista-e, by the 0ay, of thin-ing that !omments are ony &sef& if
yo& are panning on sho0ing yo&r !o%e to somebo%y ese. ,hen yo& start 0riting
yo&r o0n !o%e, yo& 0i be amaGe% at ho0 <&i!-y yo& forget the p&rpose of a
parti!&ar se!tion or statement. 'f yo& are 0or-ing on a so&tion to one of the
e1er!ises in this boo- on Fri%ay afternoon, then !ome ba!- to it on Mon%ay
morning, it 0i probaby ta-e yo& <&ite a 0hie to pi!- &p 0here yo& eft off.
Comments !an hep 0ith this probem by giving yo& hints abo&t the p&rpose of
!o%e, meaning that yo& spen% ess time trying to &n%erstan% yo&r o% !o%e, th&s
spee%ing &p yo&r progress. 4 si%e benefit is that 0riting a !omment for a bit of !o%e
reinfor!es yo&r &n%erstan%ing at the time yo& are %oing it. 4 goo% habit to get into
is 0riting a <&i!- one/ine !omment above any ine of !o%e that %oes something
interesting:
4 print a friendly greeting
print(,1ello 2orld,&
3o&. see this te!hni<&e &se% a ot in the !o%e e1ampes in this boo-, an% '
en!o&rage yo& to &se it for yo&r o0n !o%e as 0e. $here are other 0ays to &se
!omments 0hi!h 0or- 0ith #ython.s b&it/in hep system 5 ta-e a oo- at the
!hapter on mo%&es an% testing in A"vance" Python for 8iologists.
1@ Chapter 2: #rinting an% manip&ating te1t
)rror messages an% %eb&gging
't may seem %epressing eary in the boo- to be ta-ing abo&t errors8 7o0ever, it.s
0orth pointing o&t at this eary stage that computer programs almost never
wor1 correctl+ t,e first time. #rogramming ang&ages are not i-e nat&ra
ang&ages 5 they have a very stri!t set of r&es, an% if yo& brea- any of them, the
!omp&ter 0i not attempt to g&ess 0hat yo& inten%e%, b&t instea% 0i stop
r&nning an% present yo& 0ith an error message. 3o&.re going to be seeing a ot of
these error messages in yo&r programming !areer, so et.s get &se% to them as soon
as possibe.
)orgetting 2uotes
7ere.s one possibe error 0e !an ma-e 0hen printing a ine of o&tp&t 5 0e !an
forget to in!&%e the <&otes:
print(1ello 2orld&
$his is easiy %one, so et.s ta-e a oo- at the o&tp&t 0e. get if 0e try to r&n the
above !o%e
1
:
6 python error.py
7ile ,error.py, line "
print(1ello 2orld&
8
Syntax9rror: in0alid syntax
1 $he o&tp&t that yo& see might be very sighty %ifferent from this, %epen%ing on a b&n!h of fa!tors i-e
yo&r operating system an% the e1a!t version of #ython yo& are &sing.
1
2
3
4
5
20 Chapter 2: #rinting an% manip&ating te1t
2eferring to the ine n&mbers on the eft 0e !an see that the name of the #ython
fie is error.py :ine 1= an% that the error o!!&rs on the first ine of the fie :ine
2=. #ython.s best g&ess at the o!ation of the error is 9&st before the !ose
parentheses :ine 3=. Depen%ing on the type of error, this !an be 0rong by <&ite a
bit, so %on.t rey on it too m&!h8
$he type of error is a SyntaxError :ine C=, 0hi!h mean that #ython !an.t
&n%erstan% the !o%e 5 it brea-s the r&es in some 0ay :in this !ase, the r&e that
strings m&st be s&rro&n%e% by <&otation mar-s=. ,e. see %ifferent types of errors
ater in this boo-. For a %is!&ssion of ho0 these errors are a!t&ay generate%, an%
ho0 0e !an %ea 0ith them, see the !hapter on e1!eptions in A "vance" Python for
8iologists.
3pelling mista1es
,hat happens if 0e miss/spe the name of the f&n!tionE:
prin(,1ello 2orld,&
,e get a %ifferent type of error 5 a NameError – an% the error message is a bit
more hepf&:
6 python error.py
+racebac: (most recent call last&:
7ile ,error.py, line " in ;module-
prin(,1ello 2orld,&
<ame9rror: name 3prin3 is not defined
1
2
3
4
5
21 Chapter 2: #rinting an% manip&ating te1t
$his time, #ython %oesn.t try to sho0 &s 0here on the ine the error o!!&rre%, it
9&st sho0s &s the 0hoe ine :ine B=. $he error message tes &s 0hi!h 0or% #ython
%oesn.t &n%erstan% :ine C=, so in this !ase, it.s <&ite easy to fi1.
3plitting a statement over two lines
,hat if 0e 0ant to print some o&tp&t that spans m&tipe inesE For e1ampe, 0e
0ant to print the 0or% >7eo> on one ine an% then the 0or% >,or%> on the ne1t
ine 5 i-e this:
1ello
=orld
,e might try p&tting a ne0 ine in the mi%%e of o&r string i-e this:
print(,1ello
=orld,&
b&t that 0on.t 0or- an% 0e. get the foo0ing error message:
6 python error.py
7ile ,error.py, line "
print(,1ello
8
Syntax9rror: 9>? 2hile scanning string literal
#ython fin%s the error 0hen it gets to the en% of the first ine of !o%e :ine 2 in the
o&tp&t=. $he error message :ine C= is a bit more !rypti! than the others. >6 stan%s
for )n% ?f +ine, an% string literal means a string in <&otes. "o to p&t this error
message in pain )ngish: :5 starte" rea"ing a string in ;uotes0 an" 5 got to the en" of
the line before 5 came to the closing ;uotation mark:
'f spitting the ine &p %oesn.t 0or-, then ho0 %o 0e get the o&tp&t 0e 0ant.....E
1
2
3
4
5
22 Chapter 2: #rinting an% manip&ating te1t
Printing special characters
$he reason that the !o%e above %i%n.t 0or- is that #ython got !onf&se% abo&t
0hether the ne0 ine 0as part of the string :0hi!h is 0hat 0e 0ante%= or part of the
source co"e :0hi!h is ho0 it 0as a!t&ay interprete%=. ,hat 0e nee% is a 0ay to
in!&%e a ne0 ine as part of a string, an% &!-iy for &s, #ython has 9&st s&!h a too
b&it in. $o in!&%e a ne0 ine, 0e 0rite a ba!-sash foo0e% by the etter n 5
#ython -no0s that this is a spe!ia !hara!ter an% 0i interpret it a!!or%ingy.
7ere.s the !o%e 0hi!h prints >7eo 0or%> a!ross t0o ines:
4 ho2 to include a ne2 line in the middle of a string
print(,1ello.n2orld,&
;oti!e that there.s no nee% for a spa!e before or after the ne0 ine.
$here are a fe0 other &sef& spe!ia !hara!ters as 0e, a of 0hi!h !onsist of a
ba!-sash foo0e% by a etter. $he ony ones 0hi!h yo& are i-ey to nee% for the
e1er!ises in this boo- are the tab !hara!ter :\t= an% the carriage return !hara!ter
:\r=. $he tab !hara!ter !an sometimes be &sef& 0hen 0riting a program that 0i
pro%&!e a ot of o&tp&t. $he !arriage ret&rn !hara!ter 0or-s a bit i-e a ne0 ine in
that it p&ts the !&rsor ba!- to the start of the ine, b&t %oesn.t a!t&ay start a ne0
ine, so yo& !an &se it to over0rite o&tp&t 5 this is sometimes &sef& for ong/
r&nning programs.
&toring strings in variables
?K, 0e.ve been paying aro&n% 0ith the print f&n!tion for a 0hieD et.s intro%&!e
something ne0. ,e !an ta-e a string an% assign a name to it &sing an e<&as sign 5
0e !a this a variable:
4 store a short @<! seAuence in the 0ariable my/dna
my/dna B ,!+()(+!,
23 Chapter 2: #rinting an% manip&ating te1t
$he variabe my_dna no0 points to the string !"#C#"!. ,e !a this assigning a
variabe, an% on!e 0e.ve %one it, 0e !an &se the variabe name instea% of the string
itsef 5 for e1ampe, 0e !an &se it in a print statement
1
:
4 store a short @<! seAuence in the 0ariable my/dna
my/dna B ,!+()(+!,
4 no2 print the @<! seAuence
print(my/dna&
;oti!e that 0hen 0e &se the variabe in a print statement, 0e %on.t nee% any
<&otation mar-s 5 the <&otes are part of the string, so they are area%y >b&it in> to
the variabe my_dna.
,e !an !hange the va&e of a variabe as many times as 0e i-e on!e 0e.ve !reate%
it:
my/dna B ,!+()(+!,
print(my/dna&
4 change the 0alue of my/dna
my/dna B ,+((+))!,
7ere.s a very important point that trips many beginners &p: variabe names are
arbitrar+ 5 that means that 0e !an pi!- w,atever we li1e to be the name of a
variabe. "o o&r !o%e above 0o&% 0or- in e1a!ty the same 0ay if 0e pi!-e% a
%ifferent variabe name:
4 store a short @<! seAuence in the 0ariable banana
banana B ,!+()(+!,
4 no2 print the @<! seAuence
print(banana&
1 'f it.s not !ear 0hy this is &sef&, %on.t 0orry 5 it 0i be!ome m&!h more apparent 0hen 0e oo- at some
onger e1ampes.
2B Chapter 2: #rinting an% manip&ating te1t
,hat ma-es a goo% variabe nameE Reneray, it.s a goo% i%ea to &se a variabe
name that gives &s a !&e as to 0hat the variabe refers to. 'n this e1ampe, my_dna
is a goo% variabe name, be!a&se it tes &s that the !ontent of the variabe is a D;4
se<&en!e. Conversey, $anana is a ba% variabe name, be!a&se it %oesn.t reay te
&s anything abo&t the va&e that.s store%. 4s yo& rea% thro&gh the !o%e e1ampes
in this boo-, yo&. get a better i%ea of 0hat !onstit&tes goo% an% ba% variabe
names.
$his i%ea 5 that names for things are arbitrary, an% !an be anything 0e i-e 5 is a
theme that 0i o!!&r many times in this boo-, so it.s important to -eep it in min%.
?!!asionay yo& 0i see a variabe name that loo1s li1e it has some sort of
reationship 0ith the va&e it points to:
my/file B ,my/file.txt,
b&t %on.t be fooe%8 Sariabe names an% strings are separate things.
' sai% above that variabe names !an be anything 0e 0ant, b&t it.s a!t&ay not <&ite
that simpe 5 there are some r&es 0e have to foo0. ,e are ony ao0e% to &se
etters, n&mbers, an% &n%ers!ores, so 0e !an.t have variabe names that !ontain
o%% !hara!ters i-e T, U or A. ,e are not ao0e% to start a name 0ith a n&mber
:tho&gh 0e !an &se n&mbers in the mi%%e or at the en% of a name=. Finay, 0e
!an.t &se a 0or% that.s area%y b&it in to the #ython ang&age i-e >print>.
't.s aso important to remember that variabe names are !ase/sensitive, so
my_dna% &'_(N!% &y_(N! and &y_(na are a separate variabes.
$e!hni!ay this means that yo& !o&% &se a fo&r of those names in a #ython
program to store %ifferent va&es, b&t pease %on.t %o this 5 it is very easy to
be!ome !onf&se% 0hen yo& &se very simiar variabe names.
2C Chapter 2: #rinting an% manip&ating te1t
'ools for manipulating strings
;o0 0e -no0 ho0 to store an% print strings, 0e !an ta-e a oo- at a fe0 of the
fa!iities that #ython has for manip&ating them. 't.s a!t&ay possibe to e1pore
the toos for manip&ating parti!&ar types of %ata from 0ithin #ython itsef 5 see
the !hapter on mo%&es an% testing in A"vance" Python for 8iologists for a
%is!&ssion 5 b&t for no0 0e. 9&st ta-e a oo- at some of the most &sef& ones. 'n
the e1er!ises at the en% of this !hapter, 0e. oo- at ho0 0e !an &se m&tipe
%ifferent toos together in or%er to !arry o&t more !ompe1 operations.
Concatenation
,e !an !on!atenate :sti!- together= t0o strings &sing the V symbo
1
. $his symbo
0i 9oin together the string on the eft 0ith the string on the right:
my/dna B ,!!++, C ,(()),
print(my/dna&
+et.s ta-e a oo- at the o&tp&t:
!!++(())
'n the above e1ampe, the things being !on!atenate% 0ere strings, b&t 0e !an aso
&se variabes that point to strings:
upstream B ,!!!,
my/dna B upstream C ,!+(),
4 my/dna is no2 ,!!!!+(),
1 ,e !a this the concatenation operator=
2F Chapter 2: #rinting an% manip&ating te1t
,e !an even 9oin m&tipe strings together in one go:
upstream B ,!!!,
do2nstream B ,(((,
my/dna B upstream C ,!+(), C do2nstream
4 my/dna is no2 ,!!!!+()(((,
't.s important to reaiGe that the res&t of !on!atenating t0o strings together is
itsef a string. "o it.s perfe!ty ?K to &se a !on!atenation insi%e a print statement:
print(,1ello, C , , C ,2orld,&
4s 0e. see in the rest of the boo-, &sing one too insi%e another is <&ite a !ommon
thing to %o in #ython.
)inding t,e lengt, of a string
4nother &sef& b&it/in too in #ython is the len f&n!tion :len is short for ength=.
J&st i-e the print f&n!tion, the len f&n!tion ta-es a singe arg&ment :ta-e a
<&i!- oo- ba!- at 0hen 0e 0ere %is!&ssing the print f&n!tion for a remin%er
abo&t 0hat arg&ments are= 0hi!h is a string. 7o0ever, the behavio&r of the len
f&n!tion is <&ite %ifferent. 'nstea% of o&tp&tting te1t to the s!reen, len o&tp&ts a
va&e that !an be store% 5 0e !a this the return value. 'n other 0or%s, if 0e 0rite a
program that &ses len to !a!&ate the ength of a string, the program 0i r&n b&t
0e 0on.t see any o&tp&t:
4 this line doesn3t produce any output
len(,!+(),&
'f 0e 0ant to a!t&ay &se the ret&rn va&e, 0e nee% to store it in a variabe, an%
then %o something &sef& 0ith it :i-e printing it=:
2H Chapter 2: #rinting an% manip&ating te1t
dna/length B len(,!(+),&
print(dna/length&
$here.s another interesting thing abo&t the len f&n!tion: the res&t :or return
value= is not a string, it.s a n&mber. $his is a very important i%ea so '.m going to
0rite it o&t in bo%: P+t,on treats strings and numbers differentl+-
,e !an see that this is the !ase if 0e try to !on!atenate together a n&mber an% a
string. Consi%er this short program 0hi!h !a!&ates the ength of a D;4 se<&en!e
an% then prints a message teing &s the ength:
4 store the @<! seAuence in a 0ariable
my/dna B ,!+()(!(+,
4 calculate the length of the seAuence and store it in a 0ariable
dna/length B len(my/dna&
4 print a message telling us the @<! seAuence lenth
print(,+he length of the @<! seAuence is , C dna/length&
,hen 0e try to r&n this program, 0e get the foo0ing error:
6 python error.py
+racebac: (most recent call last&:
7ile ,error.py, line % in ;module-
print(,+he length of the @<! seAuence is , C dna/length&
+ype9rror: cannot concatenate 3str3 and 3int3 obDects
$he error message :ine C= is short b&t informative: >cannot concatenate
)str) and )int) o$*ects>. #ython is !ompaining that it %oesn.t -no0 ho0 to
!on!atenate a string :0hi!h it !as str for short= an% a n&mber :0hi!h it !as int
5 short for integer=. "trings an% n&mbers are e1ampes of types 5 %ifferent -in%s of
information that !an e1ist insi%e a program. 'f yo& 0ant to rea% more, there.s a f&
e1panation of ho0 types 0or- in the !hapter on ob9e!t/oriente% programming in
A"vance" Python for 8iologists.
1
2
3
4
5
2I Chapter 2: #rinting an% manip&ating te1t
7appiy, #ython has a b&it/in so&tion to this type mismat!h probem 5 a f&n!tion
!ae% str 0hi!h t&rns a n&mber
1
into a string so that 0e !an print it. 7ere.s ho0
0e !an mo%ify o&r program to &se it 5 '.ve remove% the !omments from this version
to ma-e it a bit more !ompa!t:
my/dna B ,!+()(!(+,
dna/length B len(my/dna&
print(,+he length of the @<! seAuence is , C str(dna/length&&
$he ony thing 0e have !hange% is that 0e.ve repa!e dna_length 0ith
str+dna_length, insi%e the print statement
2
. ;oti!e that be!a&se 0e.re &sing
one f&n!tion :str= insi%e another f&n!tion :print=, o&r statement no0 en%s 0ith
t0o !osing parentheses.
$o finish o&r %is!&ssion of the str f&n!tion, here.s a forma %es!ription of it, 0ith
a the te!hni!a terms in itai!s:
str is a function 0hi!h ta-es one argument :0hose type is number=, an% returns a
va&e :0hose type is string= representing that n&mber.
'f yo&.re &ns&re abo&t the meanings of any of the 0or%s in itai!s, s-ip ba!- to the
earier parts of this !hapter 0here 0e %is!&sse% them. *n%erstan%ing ho0 types
0or- is -ey to avoi%ing many of the fr&strations 0hi!h ne0 programmers typi!ay
en!o&nter, so ma-e s&re the i%ea is !ear in yo&r min% before moving on 0ith the
rest of this boo-.
C,anging case
,e !an !onvert a string to o0er !ase by &sing a ne0 type of synta1 5 a metho" that
beongs to strings. 4 metho" is i-e a function, b&t instea% of being b&it in to the
1 ?r a va&e of any non/string type, b&t 0e. !ome to that ater
2 'f yo& e1periment 0ith some of the !o%e here, yo& might %is!over that yo& !an aso print a n&mber %ire!ty
0itho&t &sing str 5 b&t ony if yo& %on.t try to !on!atenate it.
2@ Chapter 2: #rinting an% manip&ating te1t
#ython ang&age, it beongs to a parti!&ar type. $he metho% 0e are ta-ing abo&t
here is !ae% lo-er, an% 0e say that it beongs to the string type. 7ere.s ho0 0e
&se it:
my/dna B ,!+(),
4 print my/dna in lo2er case
print(my/dna.lo2er(&&
;oti!e ho0 &sing a metho% oo-s %ifferent to &sing a f&n!tion. ,hen 0e &se a
f&n!tion i-e print or len, 0e 0rite the f&n!tion name first an% the arg&ments go
in parentheses:
print(,!+(),&
len(my/dna&
,hen 0e &se a metho%, 0e 0rite the name of the variabe first, foo0e% by a
perio%, then the name of the metho%, then the metho% arg&ments in parentheses.
For the e1ampe 0e.re oo-ing at here, lo-er, there is no arg&ment, so the opening
an% !osing parentheses are right ne1t to ea!h other.
't.s important to noti!e that the lo-er metho% %oes not a!t&ay !hange the
variabeD instea% it ret&rns a !opy of the variabe in o0er !ase. ,e !an prove that it
0or-s this 0ay by printing the variabe before an% after r&nning lo-er. 7ere.s the
!o%e to %o so:
my/dna B ,!+(),
4 print the 0ariable
print(,before: , C my/dna&
4 run the lo2er method and store the result
lo2ercase/dna B my/dna.lo2er(&
4 print the 0ariable again
print(,after: , C my/dna&
an% here.s the o&tp&t 0e get:
30 Chapter 2: #rinting an% manip&ating te1t
before: !+()
after: !+()
J&st i-e the len f&n!tion, in or%er to a!t&ay %o anything &sef& 0ith the lo-er
metho%, 0e nee% to store the res&t :or print it right a0ay=.
(e!a&se the lo-er metho% beongs to the string type, 0e !an ony &se it on
variabes that are strings. 'f 0e try to &se it on a n&mber:
my/number B len(,!(+),&
4 my/number is 4
print(my/number.lo2er(&&
0e 0i get an error that oo-s i-e this:
!ttribute9rror: 3int3 obDect has no attribute 3lo2er3
$he error message is a bit !rypti!, b&t hopef&y yo& !an grasp the meaning:
something that is a n&mber :an int, or integer= %oes not have a lo-er metho%.
$his is a goo% e1ampe of the importan!e of types in #ython !o%e: we can onl+ use
met,ods on t,e t+pe t,at t,e+ belong to.
(efore 0e move on, et.s 9&st mention that there is another metho% that beongs to
the string type !ae% upper 5 yo& !an probaby g&ess 0hat it %oes8
Replacement
7ere.s another e1ampe of a &sef& metho% that beongs to the string type:
replace. replace is sighty %ifferent from anything 0e.ve seen before 5 it ta-es
t0o arg&ments :both strings= an% ret&rns a !opy of the variabe 0here a
o!!&rren!es of the first string are repa!e% by the se!on% string. $hat.s <&ite a ong/
0in%e% %es!ription, so here are a fe0 e1ampes to ma-e things !earer:
31 Chapter 2: #rinting an% manip&ating te1t
protein B ,0lspad:tn0,
4 replace 0aline 2ith tyrosine
print(protein.replace(,0, ,y,&&
4 2e can replace more than one character
print(protein.replace(,0ls, ,ymt,&&
4 the original 0ariable is not affected
print(protein&
4n% this is the o&tp&t 0e get:
ylspad:tny
ymtpad:tn0
0lspad:tn0
,e. ta-e a oo- at more toos for !arrying o&t string repa!ement in !hapter H.
4xtracting part of a string
,hat %o 0e %o if 0e have a ong string, b&t 0e ony 0ant a short portion of itE $his
is -no0n as ta-ing a substring, an% it has its o0n notation in #ython. $o get a
s&bstring, 0e foo0 the variabe name 0ith a pair of s<&are bra!-ets 0hi!h en!ose
a start an% stop position, separate% by a !oon. 4gain, this is probaby easier to
vis&aiGe 0ith a !o&pe of e1ampes 5 et.s re&se o&r protein se<&en!e from before:
protein B ,0lspad:tn0,
4 print positions three to fi0e
print(protein'3:$*&
4 positions start at Eero not one
print(protein'#:%*&
4 if 2e use a stop position beyond the end it3s the same as using the end
print(protein'#:%#*&
an% here.s the o&tp&t:
32 Chapter 2: #rinting an% manip&ating te1t
pa
0lspad
0lspad:tn0
$here are t0o important things to noti!e here. Firsty, 0e a!t&ay start !o&nting
from position Gero, rather than one 5 in other 0or%s, position 3 is a!t&ay the
fo&rth !hara!ter
1
. $his e1pains 0hy the first !hara!ter of the first ine of o&tp&t is
p an% not s as yo& might thin-. "e!on%y, the positions are inclusive at the start,
b&t exclusive at the stop. 'n other 0or%s, the e1pression protein./012 gives &s
everything starting at the fo&rth !hara!ter, an% stopping 9&st before the si1th
!hara!ter :i.e. !hara!ters fo&r an% five=.
'f 0e 9&st give a singe n&mber in the s<&are bra!-ets, 0e. 9&st get a singe
!hara!ter:
protein B ,0lspad:tn0,
first/residue B protein'#*
,e. earn a ot more abo&t this type of notation, an% 0hat 0e !an %o 0ith it, in
!hapter B.
Counting and finding substrings
4 very !ommon 9ob in bioogy is to !o&nt the n&mber of times some pattern o!!&rs
in a D;4 or protein se<&en!e. 'n !omp&ter programming terms, 0hat that
transates to is !o&nting the n&mber of times a substring o!!&rs in a string. $he
metho% that %oes the 9ob is !ae% count. 't ta-es a singe arg&ment 0hose type is
string, an% ret&rns the n&mber of times that the arg&ment is fo&n% in the variabe.
$he ret&rn type is a n&mber, so be !aref& abo&t ho0 yo& &se it8
1 $his seems very annoying 0hen yo& first en!o&nter it, b&t 0e. see ater 0hy it.s ne!essary.
33 Chapter 2: #rinting an% manip&ating te1t
+et.s &se o&r protein se<&en!e one ast time as an e1ampe. 2emember that 0e have
to &se o&r o% frien% str to t&rn the !o&nts into strings so that 0e !an print them.
4so, noti!e that here ' have &se% a ban- ine to separate o&t the t0o bits of the
program :!a!&ating the !o&nts, an% printing them=. #ython is perfe!ty happy
0ith this 5 it 9&st ignores ban- ines, so it.s fine to p&t them in in or%er to ma-e
yo&r programs more rea%abe for h&mans.
protein B ,0lspad:tn0,
4 count amino acid residues
0aline/count B protein.count(303&
lsp/count B protein.count(3lsp3&
tryptophan/count B protein.count(323&
4 no2 print the counts
print(,0alines: , C str(0aline/count&&
print(,lsp: , C str(lsp/count&&
print(,tryptophans: , C str(tryptophan/count&&
$he o&tp&t sho0s ho0 the !o&nt metho% behaves:
0alines: 2
leucines: "
tryptophans: #
4 !osey/reate% probem to !o&nting s&bstrings is fin%ing their o!ation. ,hat if
instea% of !o&nting the n&mber of proine resi%&es in o&r protein se<&en!e 0e
0ant to -no0 0here they areE $he find metho% 0i give &s the ans0er, at east for
simpe !ases. find ta-es a singe string arg&ment, 9&st i-e count, an% ret&rns a
n&mber 0hi!h is the position at 0hi!h that s&bstring first appears in the string :in
!omp&ting, 0e !a that the in"e! of the s&bstring=.
3B Chapter 2: #rinting an% manip&ating te1t
2emember that in #ython 0e start !o&nting from Gero rather than one, so position
0 is the first !hara!ter, position B is the fifth !hara!ter, et!. 4 !o&pe of e1ampes:
protein B ,0lspad:tn0,
print(str(protein.find(3p3&&&
print(str(protein.find(3:t3&&&
print(str(protein.find(323&&&
4n% the o&tp&t:
3
%
F"
;oti!e the behavio&r of fin% 0hen 0e as- it to o!ate a s&bstring that %oesn.t e1ist
5 0e get ba!- the ans0er -3.
(oth count an% find have a pretty serio&s imitation: yo& !an ony sear!h for
e1a!t s&bstrings. 'f yo& nee% to !o&nt the n&mber of o!!&rren!es of a variabe
protein motif, or fin% the position of a variabe trans!ription fa!tor bin%ing site,
they 0i not hep yo&. $he 0hoe of !hapter H is %evote% to toos that !an %o those
-in%s of 9obs.
?f the toos 0e.ve %is!&sse% in this se!tion, three 5 replace, count an% find 5
re<&ire at east t0o strings to 0or-, so be !aref& that yo& %on.t get !onf&se% abo&t
the or%er 5 remember that:
my_dna.count+my_motif,
is not the same as:
my/motif.count(my/dna&
3C Chapter 2: #rinting an% manip&ating te1t
"pitting &p a string into m&tipe bits
4n obvio&s <&estion 0hi!h bioogists often as- 0hen earning to program is >ho0
%o 0e spit a string :e.g. a D;4 se<&en!e= into m&tipe pie!esE> $hat.s a !ommon
9ob in bioogy, b&t &nfort&natey 0e !an.t %o it yet &sing the toos from this !hapter.
,e. ta- abo&t vario&s %ifferent 0ays of spitting strings in !hapter B. ' mention it
here 9&st to reass&re yo& that 0e 0i earn ho0 to %o it event&ay8
(ecap
,e starte% this !hapter ta-ing abo&t strings an% ho0 to 0or- 0ith them, b&t aong
the 0ay 0e ha% to ta-e a ot of %iversions, a of 0hi!h 0ere ne!essary to
&n%erstan% ho0 the %ifferent string toos 0or-. $han-f&y, that means that 0e.ve
!overe% most of the n&ts an% bots of the #ython ang&age, 0hi!h 0i ma-e f&t&re
!hapters go m&!h more smoothy.
,e.ve earne% abo&t some genera feat&res of the #ython programming ang&age
i-e
• the %ifferen!e bet0een functions, statements an% arguments
• the importan!e of comments an% ho0 to &se them
• ho0 to &se #ython.s error messages to fi1 b&gs in o&r programs
• ho0 to store values in variables
• the 0ay that types 0or-, an% the importan!e of &n%erstan%ing them
• the %ifferen!e bet0een functions an% metho"s, an% ho0 to &se them both
4n% 0e.ve en!o&ntere% some toos that are spe!ifi!ay for 0or-ing 0ith strings:
• !on!atenation
• %ifferent types of <&otes an% ho0 to &se them
• spe!ia !hara!ters
3F Chapter 2: #rinting an% manip&ating te1t
• !hanging the !ase of a string
• fin%ing an% !o&nting s&bstrings
• repa!ing bits of a string 0ith something ne0
• e1tra!ting bits of a string to ma-e a ne0 string
Many of the above topi!s 0i !rop &p again in f&t&re !hapters, an% 0i be
%is!&sse% in more %etai, b&t yo& !an a0ays ret&rn to this !hapter if yo& 0ant to
br&sh &p on the basi!s. $he e1er!ises for this !hapter 0i ao0 yo& to pra!ti!e
&sing the string manip&ation toos an% to be!ome famiiar 0ith them. $hey. aso
give yo& the !han!e to pra!ti!e b&i%er bigger programs by &sing the in%ivi%&a
toos as b&i%ing bo!-s.
3H Chapter 2: #rinting an% manip&ating te1t
!ercises
Reminder: the %es!riptions of the e1er!ises are %eiberatey terse an% may be
some0hat ambig&o&s :9&st i-e re<&irements for programs yo& 0i 0rite in rea
ife=. "ee the so&tions for in/%epth %is!&ssions of the e1er!ises.
Calculating 5T content
7ere.s a short D;4 se<&en!e:
!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+
,rite a program that 0i print o&t the 4$ !ontent of this D;4 se<&en!e. 7int: yo&
!an &se norma mathemati!a symbos i-e a%% :V=, s&btra!t :/=, m&tipy :W=, %ivi%e
:/= an% parentheses to !arry o&t !a!&ations on n&mbers in #ython.
Reminder: if yo&.re &sing #ython 2 rather than #ython 3, in!&%e this ine at the
top of yo&r program:
from //future// import di0ision
Complementing (/5
7ere.s a short D;4 se<&en!e:
!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+
,rite a program that 0i print the !ompement of this se<&en!e.
3I Chapter 2: #rinting an% manip&ating te1t
Restriction fragment lengt,s
7ere.s a short D;4 se<&en!e:
!)+(!+)(!++!)(+!+!(+!(!!++)+!+)!+!)!+!+!+!+)(!+()(++)!+
$he se<&en!e !ontains a re!ognition site for the )!o2' restri!tion enGyme, 0hi!h
!&ts at the motif RW44$$C :the position of the !&t is in%i!ate% by an asteris-=.
,rite a program 0hi!h 0i !a!&ate the siGe of the t0o fragments that 0i be
pro%&!e% 0hen the D;4 se<&en!e is %igeste% 0ith )!o2'.
3plicing out introns* part one
7ere.s a short se!tion of genomi! D;4:
4$CR4$CR4$CR4$CR4C$R4C$4R$C4$4RC$4$RC4$R$4RC$4C$CR4$CR4$CR4
$CR4$CR4$CR4$CR4$CR4$CR4$C4$RC$4$C4$CR4$CR4$4$CR4$RC4$CR4C$
4C$4$
't !omprises t0o e1ons an% an intron. $he first e1on r&ns from the start of the
se<&en!e to the si1ty/thir% !hara!ter, an% the se!on% e1on r&ns from the ninety/
first !hara!ter to the en% of the se<&en!e. ,rite a program that 0i print 9&st the
!o%ing regions of the D;4 se<&en!e.
3plicing out introns* part two
*sing the %ata from part one, 0rite a program that 0i !a!&ate 0hat per!entage
of the D;4 se<&en!e is !o%ing.
Reminder: if yo&.re &sing #ython 2 rather than #ython 3, in!&%e this ine at the
top of yo&r program:
from //future// import di0ision
3@ Chapter 2: #rinting an% manip&ating te1t
3plicing out introns* part t,ree
*sing the %ata from part one, 0rite a program that 0i print o&t the origina
genomi! D;4 se<&en!e 0ith !o%ing bases in &pper!ase an% non/!o%ing bases in
o0er!ase.
B0 Chapter 2: #rinting an% manip&ating te1t
&olutions
Calculating 5T content
$his e1er!ise is going to invove a mi1t&re of strings an% n&mbers. +et.s remin%
o&rseves of the form&a for !a!&ating 4$ !ontent:
AT content=
A+T
length
$here are three n&mbers 0e nee% to fig&re o&t: the n&mber of !s, the n&mber of "s,
an% the ength of the se<&en!e. ,e -no0 that 0e !an get the ength of the
se<&en!e &sing the len f&n!tion, an% 0e !an !o&nt the n&mber of !s an% "s &sing
the count metho%. 7ere are a fe0 ines of !o%e that 0e thin- 0i !a!&ate the
n&mbers 0e nee%:
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
length B len(my/dna&
a/count B my/dna.count(3!3&
t/count B my/dna.count(3+3&
4t this point, it seems sensibe to !he!- these ines before 0e go any f&rther. "o
rather than %iving straight in an% %oing some !a!&ations, et.s print o&t these
n&mbers so that 0e !an eyeba them an% see if they oo- appro1imatey right. ,e.
have to remember to t&rn the n&mbers into strings &sing str so that 0e !an print
them:
B1 Chapter 2: #rinting an% manip&ating te1t
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
length B len(my/dna&
a/count B my/dna.count(3!3&
t/count B my/dna.count(3+3&
print(,length: , C str(length&&
print(,! count: , C str(a/count&&
print(,+ count: , C str(t/count&&
+et.s ta-e a oo- at the o&tp&t from this program:
length: $4
! count: "%
+ count: 2"
$hat oo-s abo&t right, b&t ho0 %o 0e -no0 if it.s e1a!ty rightE ,e !o&% go
thro&gh the se<&en!e man&ay base by base, an% verify that there are si1teen 4s
an% eighteen $s, b&t that %oesn.t seem i-e a great &se of o&r time: aso, 0hat
0o&% 0e %o if the se<&en!e 0ere C1 -iobases rather than C1 basesE 4 better i%ea is
to r&n the e1a!t same !o%e 0ith a m&!h shorter test se<&en!e, to verify that it
0or-s before going ahea% an% r&nning it on the arger se<&en!e.
7ere.s a version that &ses a very short test se<&en!e 0ith one of ea!h of the fo&r
bases:
test/dna B ,!+(),
length B len(test/dna&
a/count B test/dna.count(3!3&
t/count B test/dna.count(3+3&
print(,length: , C str(length&&
print(,! count: , C str(a/count&&
print(,+ count: , C str(t/count&&
an% here.s the o&tp&t:
B2 Chapter 2: #rinting an% manip&ating te1t
length: 4
! count: "
+ count: "
)verything oo-s ?K 5 0e !an probaby go ahea% an% r&n the !o%e on the ong
se<&en!e. (&t 0aitD 0e -no0 that the ne1t step is going to invove %oing some
!a!&ations &sing the n&mbers. 'f 0e s0it!h ba!- to the ong se<&en!e no0, then
0e. be in the same position as 0e 0ere before 5 0e. en% &p 0ith an ans0er for
the 4$ !ontent, b&t 0e 0on.t -no0 if it.s the right one.
4 better pan is to sti!- 0ith the short test se<&en!e &nti 0e.ve 0ritten the 0hoe
program, an% !he!- that 0e get the right ans0er for the 4$ !ontent :0e !an easiy
see by gan!ing at the test se<&en!e that the 4$ !ontent is 0.C=. 7ere goes 5 0e.
&se the a%% an% %ivi%e symbos from the e1er!ise hint:
test/dna B ,!+(),
length B len(test/dna&
a/count B test/dna.count(3!3&
t/count B test/dna.count(3+3&
at/content B a/count C t/count / length
print(,!+ content is , C str(at/content&&
$he o&tp&t from this program oo-s i-e this:
!+ content is ".2$
$hat %oesn.t oo- right. +oo-ing ba!- at the !o%e 0e !an see 0hat has gone 0rong 5
in the !a!&ation, the %ivision has ta-en pre!e%en!e over the a%%ition, so 0hat 0e
have a!t&ay !a!&ate% is:
A+
T
length
B3 Chapter 2: #rinting an% manip&ating te1t
$o fi1 it, a 0e nee% to %o is a%% some parentheses aro&n% the a%%ition, so that the
ine be!omes:
at/content B (a/count C t/count& / length
;o0 0e get the !orre!t o&tp&t for the test se<&en!e:
!+ content is #.$
an% 0e !an go ahea% an% r&n the program &sing the onger se<&en!e, !onfi%ent
that the !o%e is 0or-ing an% that the !a!&ations are !orre!t. 7ere.s the fina
version:
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
length B len(my/dna&
a/count B my/dna.count(3!3&
t/count B my/dna.count(3+3&
at/content B (a/count C t/count& / length
print(,!+ content is , C str(at/content&&
an% the fina o&tp&t:
!+ content is #.%G$"G$"G$"G$"G$2
Complementing (/5
$his one seems pretty straightfor0ar% 5 0e nee% to ta-e o&r se<&en!e an% repa!e
4 0ith $, $ 0ith 4, C 0ith R, an% R 0ith C. ,e. have to ma-e fo&r separate !as to
replace, an% &se the ret&rn va&e for ea!h on as the inp&t for the ne1t tone. +et.s
try it:
BB Chapter 2: #rinting an% manip&ating te1t
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
4 replace ! 2ith +
replacement" B my/dna.replace(3!3 3+3&
4 replace + 2ith !
replacement2 B replacement".replace(3+3 3!3&
4 replace ) 2ith (
replacement3 B replacement2.replace(3)3 3(3&
4 replace ( 2ith )
replacement4 B replacement3.replace(3(3 3)3&
4 print the result of the final replacement
print(replacement4&
,hen 0e ta-e a oo- at the o&tp&t, ho0ever, something seems 0rong:
!)!)!!))!!!!))!!!!)!!!!!))!!!)!!!)!!!!!!!!))!!)))!!)!!
,e !an see 9&st by oo-ing at the origina se<&en!e that the first etter is 4, so the
first etter of the printe% se<&en!e sho&% be its !ompement, $. (&t instea% the
first etter is 4. 'n fa!t, a of the bases in the printe% se<&en!e are either 4 or $.
$his is %efinitey not 0hat 0e 0ant8
+et.s try an% tra!- the probem %o0n by printing o&t a the interme%iate steps as
0e:
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
replacement" B my/dna.replace(3!3 3+3&
print(replacement"&
replacement2 B replacement".replace(3+3 3!3&
print(replacement2&
replacement3 B replacement2.replace(3)3 3(3&
print(replacement3&
replacement4 B replacement3.replace(3(3 3)3&
print(replacement4&
$he o&tp&t from this program ma-es it !ear 0hat the probem is:
BC Chapter 2: #rinting an% manip&ating te1t
+)+(++)(++++)(++++(+++++()+++)+++)++++++++)(++()(++)++
!)!(!!)(!!!!)(!!!!(!!!!!()!!!)!!!)!!!!!!!!)(!!()(!!)!!
!(!(!!((!!!!((!!!!(!!!!!((!!!(!!!(!!!!!!!!((!!(((!!(!!
!)!)!!))!!!!))!!!!)!!!!!))!!!)!!!)!!!!!!!!))!!)))!!)!!
$he first repa!ement :the res&t of 0hi!h is sho0n in the first ine of the o&tp&t=
0or-s fine 5 a the 4s have been repa!e% 0ith $s :for e1ampe, oo- at the first
!hara!ter 5 it.s 4 in the origina se<&en!e an% $ in the first ine of the o&tp&t=.
$he se!on% repa!ement is 0here it starts to go 0rong: a the $s are repa!e% by 4s,
including t,ose t,at were t,ere as a result of t,e first replacement. "o %&ring
the first t0o repa!ements, the first !hara!ter is !hange% from 4 to $ an% then
straight ba!- to 4 again.
7o0 are 0e going to get ro&n% this probemE ?ne option is to pi!- a temporary
aphabet of fo&r etters an% %o ea!h repa!ement t0i!e:
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
replacement" B my/dna.replace(3!3 313&
replacement2 B replacement".replace(3+3 3H3&
replacement3 B replacement2.replace(3)3 3I3&
replacement4 B replacement3.replace(3(3 3?3&
replacement$ B replacement4.replace(313 3+3&
replacement% B replacement$.replace(3H3 3!3&
replacement7 B replacement%.replace(3I3 3(3&
replacementG B replacement7.replace(3?3 3)3&
print(replacementG&
$his gets &s the res&t 0e are oo-ing for. 't avoi%s the probem 0ith the previo&s
program by &sing another etter to stan% in for ea!h base 0hie the repa!ements
are being %one. For e1ampe, 4 is first !onverte% to 7 an% then ater on 7 is
!onverte% to $.
7ere.s a sighty more eegant 0ay of %oing it. ,e !an ta-e a%vantage of the fa!t
that the replace metho% is !ase/sensitive, an% ma-e a the repa!e% bases o0er
BF Chapter 2: #rinting an% manip&ating te1t
!ase. $hen, on!e a the repa!ements have been !arrie% o&t, 0e !an simpy !a
upper an% !hange the 0hoe se<&en!e ba!- to &pper !ase. +et.s ta-e a oo- at ho0
this 0or-s:
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
replacement" B my/dna.replace(3!3 3t3&
print(replacement"&
replacement2 B replacement".replace(3+3 3a3&
print(replacement2&
replacement3 B replacement2.replace(3)3 3g3&
print(replacement3&
replacement4 B replacement3.replace(3(3 3c3&
print(replacement4&
print(replacement4.upper(&&
$he o&tp&t ets &s see e1a!ty 0hat.s happening 5 noti!e that in this version of the
program 0e print the fina string t0i!e, on!e as it is an% then on!e !onverte% to
&pper !ase:
t)+(t+)(t++t)(+t+t(+t+++()+t+)t+t)t+t+t+t+)(t+()(++)t+
t)a(ta)(taat)(atat(ataaa()ata)tat)tatatata)(ta()(aa)ta
tga(tag(taatg(atat(ataaa(gatagtatgtatatatag(ta(g(aagta
tgactagctaatgcatatcataaacgatagtatgtatatatagctacgcaagta
+(!)+!()+!!+()!+!+)!+!!!)(!+!(+!+(+!+!+!+!()+!)()!!(+!
,e !an see that as the program r&ns, ea!h base in t&rn is repa!e% by its
!ompement in o0er !ase. "in!e the ne1t repa!ement is ony oo-ing for &pper
!ase !hara!ters, bases %on.t get !hange% ba!- as they %i% in the first version of o&r
program.
BH Chapter 2: #rinting an% manip&ating te1t
Restriction fragment lengt,s
+et.s start this e1er!ise by soving the probem man&ay. 'f 0e oo- thro&gh the
D;4 se<&en!e 0e !an spot the )!o2' site at position 21. 7ere.s the se<&en!e 0ith
the base positions abee% above an% the )!o2' motif in bo%:
" 2 3 4 $
#"234$%7GJ#"234$%7GJ#"234$%7GJ#"234$%7GJ#"234$%7GJ#"234
!)+(!+)(!++!)(+!+!(+!GAATTCT!+)!+!)!+!+!+!+)(!+()(++)!+
"in!e the )!o2' enGyme !&ts the D;4 bet0een the R an% first 4, 0e !an fig&re o&t
that the first fragment 0i r&n from position 0 to position 21, an% the se!on%
fragment from position 22 to the ast position, CB. $herefore the engths of the t0o
fragments are 22 an% 33.
,riting a program to fig&re o&t the engths is 9&st a <&estion of appying the same
ogi!. ,e. &se the find metho% to fig&re o&t the position of the start of the )!o2'
motif, then a%% one to a!!o&nt for the fa!t that the positions start !o&nting from
Gero 5 this 0i give &s the ength of the first fragment. From there 0e !an get the
ength of the se!on% fragment by fin%ing the ength of the inp&t se<&en!e an%
s&btra!ting the ength of the first fragment:
my/dna B ,!)+(!+)(!++!)(+!+!(+!(!!++)+!+)!+!)!+!+!+!+)(!+()(++)!+,
frag"/length B my/dna.find(,(!!++),& C "
frag2/length B len(my/dna& F frag"/length
print(,length of fragment one is , C str(frag"/length&&
print(,length of fragment t2o is , C str(frag2/length&&
$he o&tp&t from this program !onfirms that it agrees 0ith the ans0er 0e got
man&ay:
length of fragment one is 22
length of fragment t2o is 33
BI Chapter 2: #rinting an% manip&ating te1t
'f 0e 0ante% to r&n the same program &sing a %ifferent restri!tion enGyme, 0e.%
have to !hange bot, the string that 0e &se% in the find metho% !a, and the
n&mber that 0e a%% in or%er to ta-e a!!o&nt of the !&t site.
't.s 0orth noting that this program ass&mes that the D;4 se<&en!e %efinitey %oes
!ontain the restri!tion site 0e.re oo-ing for. 'f 0e try the same program &sing a
D;4 se<&en!e 0hi!h %oesn.t !ontain the site, it 0i report a fragment of ength 0
an% a fragment 0hose ength is e<&a to the tota ength of the D;4 se<&en!e.
,hie this is not stri!ty 0rong, it.s a itte misea%ing 5 if 0e 0ere going to &se this
program for rea/ife 0or-, 0e.% probaby prefer to have sighty %ifferent behavio&r
%epen%ing on 0hether or not the D;4 se<&en!e !ontaine% the motif 0e.re oo-ing
for. ,e. ta- abo&t ho0 to impement that type of behavio&r in !hapter F.
3plicing out introns* part one
'n this e1er!ise, 0e.re being as-e% to pro%&!e a program that %oes the 9ob of a
spi!eosome 5 spits a D;4 se<&en!e at t0o spe!ifie% o!ations to ma-e three
pie!es, then 9oin the o&ter t0o pie!es together
1
.
+et.s start by spitting the se<&en!e &p into three bits. ,e. have to &se the
s&bstring notation from earier in the !hapter, an% 0e. nee% to ta-e !are 0ith the
n&mbers. ,e -no0 that if 0e give a stop position for a s&bstring then it 0i go on
to the en% of the inp&t string, so rather than fig&re o&t the position of the en% of
the se<&en!e, 0e. 9&st be aGy an% &se a big n&mber. 7ere.s the !o%e :the first ine,
0here 0e store the D;4 se<&en!e in the variabe my_dna, is too ong to fit on one
ine on the page, so it oo-s i-e it.s sprea% o&t over m&tipe ines=:
1 ,e -no0 that that.s not reay ho0 a spi!osome 0or-s, b&t it.s fine as a !on!ept&a mo%e.
B@ Chapter 2: #rinting an% manip&ating te1t
my/dna B
,!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)(!+)(!+)
(!+)(!+)(!+)!+()+!+)!+)(!+)(!+!+)(!+()!+)(!)+!)+!+,
exon" B my/dna'":%3*
exon2 B my/dna'J":"####*
print(exon" C exon2&
$he o&tp&t from this !o%e oo-s vag&ey right:
+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)!+)(!+)(!+
!+)(!+()!+)(!)+!)+!+
b&t 0hen 0e oo- more !osey 0e !an see that something is not right. $he printe%
!o%ing se<&en!e is s&ppose% to start at the very first !hara!ter of the inp&t
se<&en!e, b&t it.s starting at the se!on%. ,e have forgotten to ta-e into a!!o&nt the
fa!t that #ython starts !o&nting from Gero, so o&r n&mbers are a too high by one.
+et.s try again:
my/dna B
,!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)(!+)(!+)
(!+)(!+)(!+)!+()+!+)!+)(!+)(!+!+)(!+()!+)(!)+!)+!+,
exon" B my/dna'#:%3*
exon2 B my/dna'J#:"####*
print(exon" C exon2&
;o0 the o&tp&t oo-s !orre!t 5 the !o%ing se<&en!e starts at the very beginning of
the inp&t se<&en!e:
!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)!+)(!+)(!
+!+)(!+()!+)(!)+!)+!+
C0 Chapter 2: #rinting an% manip&ating te1t
3plicing out introns* part two
$his is a straightfor0ar% pie!e of n&mber/!r&n!hing. $here are a !o&pe of 0ays to
go abo&t it. ,e !o&% &se the e1on start/stop !oor%inates to !a!&ate the ength of
the !o%ing portion of the se<&en!e. 7o0ever, sin!e 0e.ve area%y 0ritten the !o%e
to generate the !o%ing se<&en!e, 0e !an simpy !a!&ate the ength of it, an% then
%ivi%e by the ength of the inp&t se<&en!e:
my/dna B
,!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)(!+)(!+)
(!+)(!+)(!+)!+()+!+)!+)(!+)(!+!+)(!+()!+)(!)+!)+!+,
exon" B my/dna'#:%3*
exon2 B my/dna'J#:"####*
coding/length B len(exon" C exon2&
total/length B len(my/dna&
print(coding/length / total/length&
$he o&tp&t sho0s that 0e.re neary right:
#.7723$7723$7723$G
,e have !a!&ate% the !o%ing proportion as a fra!tion, b&t the e1er!ise !ae% for a
per!entage. ,e !an easiy fi1 this by m&tipying by 100. ;oti!e that the symbo for
m&tipi!ation is not x, as yo& might thin-, b&t 4. $he fina !o%e:
my/dna B
,!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)(!+)(!+)
(!+)(!+)(!+)!+()+!+)!+)(!+)(!+!+)(!+()!+)(!)+!)+!+,
exon" B my/dna'#:%3*
exon2 B my/dna'J#:"####*
coding/length B len(exon" C exon2&
total/length B len(my/dna&
print("## K coding/length / total/length&
gives the !orre!t o&tp&t:
C1 Chapter 2: #rinting an% manip&ating te1t
77.23$7723$7723$G
atho&gh 0e probaby %on.t reay re<&ire that n&mber of signifi!ant fig&res. 'n
!hapter C 0e 0i earn ho0 to format the o&tp&t ni!ey.
3plicing out introns* part t,ree
$his so&n%s <&ite tri!-y, b&t 0e have area%y %one the har% bit in part one. 4 0e
nee% to %o is e1tra!t the intron se<&en!e as 0e as the e1ons, !onvert it to o0er
!ase, then !on!atenate the three se<&en!es to re!reate the origina genomi!
se<&en!e:
my/dna B
,!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)(!+)(!+)
(!+)(!+)(!+)!+()+!+)!+)(!+)(!+!+)(!+()!+)(!)+!)+!+,
exon" B my/dna'#:%3*
intron B my/dna'%3:J#*
exon2 B my/dna'J#:"####*
print(exon" C intron.lo2er(& C exon2&
+oo-ing at the o&tp&t, 0e see an &pper !ase D;4 se<&en!e 0ith a o0er !ase
se!tion in the mi%%e, as e1pe!te%:
!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(atcgatcgatcg
atcgatcgatcatgct!+)!+)(!+)(!+!+)(!+()!+)(!)+!)+!+
,hen 0e are appying severa transformations to te1t, as in this e1er!ise, there are
&s&ay a n&mber of %ifferent 0ays 0e !an 0rite the program. For e1ampe, 0e
!o&% store the o0er !ase version of the intron, rather than !onverting it to o0er
!ase 0hen printing:
intron B my/dna'%3:J#*.lo2er(&
C2 Chapter 2: #rinting an% manip&ating te1t
?r 0e !o&% avoi% &sing variabes for the introns an% e1ons a together, an% %o
everything in one big print statement:
print(my/dna'#:%3* C my/dna'%3:J#*.lo2er(& C my/dna'J#:"####*&
$his ast option is very !on!ise, b&t a bit har%er to rea% than the more verbose 0ay.
4s the e1er!ises in this boo- get onger, yo&. noti!e that there are more an% more
%ifferent 0ays to 0rite the !o%e 5 yo& may en% &p 0ith so&tions that oo- very
%ifferent to the e1ampe so&tions. ,hen trying to !hoose bet0een %ifferent 0ays
to 0rite a program, a0ays favo&r the so&tion that is !earest in intent an% easiest
to rea%.
C3 Chapter 3: 2ea%ing an% 0riting fies
3: Reading and writing files
Why are we so intereste" in working with files?
4s 0e start this !hapter, 0e fin% o&rseves on!e again %oing things in a sighty
%ifferent or%er to most programming boo-s. $he ma9ority of intro%&!tory
programming boo-s 0on.t !onsi%er 0or-ing 0ith e1terna fies &nti m&!h f&rther
aong, so 0hy are 0e intro%&!ing it no0E
$he ans0er, as 0as the !ase in the ast !hapter, ies in the parti!&ar 9obs that 0e
0ant to &se #ython for. $he %ata that 0e as bioogists 0or- 0ith is store% in fies, so
if 0e.re going to 0rite &sef& programs 0e nee% a 0ay to get the %ata o&t of fies
an% into o&r programs :an% vice versa=. 4s yo& 0ere going thro&gh the e1er!ises in
the previo&s !hapter, it may have o!!&rre% to yo& that !opying an% pasting a D;4
se<&en!e %ire!ty into a program ea!h time 0e 0ant to &se it is not a very goo%
approa!h to ta-e, an% yo&.% be right. $he se<&en!es 0e 0ere 0or-ing 0ith in the
e1er!ises 0ere very shortD obvio&sy rea/ife %ata 0i be m&!h onger. 4so, it
seems ineegant to have the %ata 0e 0ant to 0or- on mi1e% &p 0ith the !o%e that
manip&ates it. 'n this !hapter 0e. see a better 0ay to %o it.
,e.re &!-y in bioogy that many of the types of %ata that 0e 0or- 0ith are store%
in te1t
1
fies 0hi!h are reativey simpe to pro!ess &sing #ython. Chief among
these, of !o&rse, are D;4 an% protein se<&en!e %ata, 0hi!h !an be store% in a
variety of formats.
2
(&t there are many other types of %ata 5 se<&en!ing rea%s,
<&aity s!ores, ";#s, phyogeneti! trees, rea% maps, geographi!a sampe %ata,
geneti! %istan!e matri!es 5 0hi!h 0e !an a!!ess from 0ithin o&r #ython programs.
1 i.e. fies 0hi!h yo& !an open in a te1t e%itor an% rea%, as oppose% to binary fies 0hi!h !annot be rea%
%ire!ty.
2 'n this boo- 0e. mosty be ta-ing abo&t F4"$4 format as it.s the simpest an% most !ommon format, b&t
there are many more.
CB Chapter 3: 2ea%ing an% 0riting fies
4nother reason for o&r interest in fie inp&t/o&tp&t is the nee% for o&r #ython
programs to 0or- as part of a pipeine or 0or- fo0 invoving other, e1isting toos.
,hen it !omes to &sing #ython in the rea 0or%, 0e often 0ant #ython to either
a!!ept %ata from, or provi%e %ata to, another program. ?ften the easiest 0ay to %o
this is to have #ython rea%, or 0rite, fies in a format that the other program
area%y &n%erstan%s.
(ea"ing te!t from a file
Firsty, a <&i!- note abo&t 0hat 0e mean by te1t. 'n programming, 0hen 0e ta-
abo&t te!t files, 0e are not ne!essariy ta-ing abo&t something that is h&man/
rea%abe. 2ather, 0e are ta-ing abo&t a fie that !ontains !hara!ters an% ines 5
something that yo& !o&% open &p an% vie0 in a te1t e%itor, regar%ess of 0hether
yo& !o&% a!t&ay ma-e sense of the fie or not. )1ampes of te1t fies 0hi!h yo&
might have en!o&ntere% in!&%e:
• F4"$4 fies of D;4 or protein se<&en!es
• fies !ontaining o&tp&t from !omman%/ine programs :e.g. (+4"$=
• F4"$L fies !ontaining D;4 se<&en!ing rea%s
• 7$M+ fies
• 0or% pro!essing %o!&ments
• an% of !o&rse, #ython !o%e
'n !ontrast, most fies that yo& en!o&nter %ay/to/%ay 0i be binary files 5 ones
0hi!h are not ma%e &p of !hara!ters an% ines, b&t of bytes. )1ampes in!&%e:
• image fies :J#)Rs an% #;Rs=
• a&%io fies
• vi%eo fies
CC Chapter 3: 2ea%ing an% 0riting fies
• !ompresse% fies :e.g. X'# fies=
'f yo&.re not s&re 0hether a parti!&ar fie is te1t or binary, there.s a very simpe
0ay to te 5 9&st open it &p in a te1t e%itor. 'f the fie %ispays 0itho&t any probem,
then it.s te1t :regar%ess of 0hether yo& !an ma-e sense of it or not=. 'f yo& get an
error or a 0arning from yo&r te1t e%itor, or the fie %ispays as a !oe!tion of
in%e!ipherabe !hara!ters, then it.s binary.
$he e1ampes an% e1er!ises in this !hapter are a itte %ifferent from those in the
previo&s one, be!a&se they rey on the e1isten!e of the fies that 0e are going to
manip&ate. 'f yo& 0ant to try r&nning the e1ampes in this !hapter, yo&. nee% to
ma-e s&re that there is a fie in yo&r 0or-ing %ire!tory !ae% "na=t!t 0hi!h has a
singe ine !ontaining a D;4 se<&en!e. $he easiest 0ay to %o this is to r&n the
e1ampes 0hie in the chapter?) fo%er insi%e the e1er!ises %o0noa%
1
.
6sing open to read a file
'n #ython, as in the physi!a 0or%, 0e have to open a fie before 0e !an rea% 0hat.s
insi%e it. $he #ython f&n!tion that !arries o&t the 9ob of opening a fie is very
sensiby !ae% open. 't ta-es one arg&ment 5 a string 0hi!h !ontains the name of
the fie 5 an% ret&rns a file ob@ectA
my/file B open(,dna.txt,&
4 file ob@ect is a ne0 type 0hi!h 0e haven.t en!o&ntere% before, an% it.s a itte
more !ompi!ate% than the string an% n&mber types that 0e sa0 in the previo&s
!hapter. ,ith strings an% n&mbers it 0as easy to &n%erstan% 0hat they represente%
5 a singe bit of te1t, or a singe n&mber. 4 fie ob9e!t, in !ontrast, represents
something a bit ess tangibe 5 it represents a fie on yo&r !omp&ter.s har% %rive.
1 'f yo& haven.t %o0noa%e% the e1ampe fies yet, yo&. fin% the in- in the emai that !ame 0ith yo&r
p&r!hase of this boo-.
CF Chapter 3: 2ea%ing an% 0riting fies
$he 0ay that 0e &se fie ob9e!ts is a bit %ifferent to strings an% n&mbers as 0e. 'f
yo& gan!e ba!- at the e1ampes from the previo&s !hapter yo&. see that most of
the time 0hen 0e 0ant to &se a variabe !ontaining a string or n&mber 0e 9&st &se
the variabe name:
my/string B 3abcdefg3
print(my/string&
my/number B 42
print(my/number C "&
'n !ontrast, 0hen 0e.re 0or-ing 0ith fie ob9e!ts most of o&r intera!tion 0i be
thro&gh metho"s. $his stye of programming 0i seen &n&s&a at first, b&t as 0e.
see in this !hapter, the fie type has a 0e tho&ght/o&t set of metho%s 0hi!h et &s
%o ots of &sef& things.
$he first thing 0e nee% to be abe to %o is to rea% the !ontents of the fie. $he fie
type has a read metho% 0hi!h %oes this. 't %oesn.t ta-e any arg&ments, an% the
ret&rn va&e is a string, 0hi!h 0e !an store in a variabe. ?n!e 0e.ve rea% the fie
!ontents into a variabe, 0e !an treat them 9&st i-e any other string 5 for e1ampe,
0e !an print them:
my/file B open(,dna.txt,&
file/contents B my/file.read(&
print(file/contents&
/iles0 contents an" file names
,hen earning to 0or- 0ith fies it.s very easy to get !onf&se% bet0een a file ob@ect,
a file name, an% the contents of a fie. $a-e a oo- at the foo0ing bit of !o%e:
CH Chapter 3: 2ea%ing an% 0riting fies
my/file/name B ,dna.txt,
my/file B open(my/file/name&
my/file/contents B my/file.read(&
,hat.s going on hereE ?n ine 1, 0e store the string "na=t!t in the variabe
my_file_name. ?n ine 2, 0e &se the variabe my_file_name as the arg&ment
to the open f&n!tion, an% store the res&ting fie ob9e!t in the variabe my_file.
?n ine 3, 0e !a the read metho% on the variabe my_file, an% store the
res&ting string in the variabe my_file_contents.
$he important thing to &n%erstan% abo&t this !o%e is that there are three separate
variabes 0hi!h have %ifferent types an% 0hi!h are storing three very %ifferent
things. my_file_name is a string, an% it stores the name of a fie on %is-.
my_file is a fie ob9e!t, an% it represents the fie itsef. my_file_contents is a
string, an% it stores the te1t that is in the fie.
2emember that variabe names are arbitrary 5 the !omp&ter %oesn.t !are 0hat yo&
!a yo&r variabes. "o this pie!e of !o%e is e1a!ty the same as the previo&s
e1ampe:
apple B ,dna.txt,
banana B open(apple&
grape B banana.read(&
e1!ept it is har%er to rea%8 'n !ontrast, the fie name :"na=t!t= is not arbitrary 5 it
m&st !orrespon% to the name of a fie on the har% %rive of yo&r !omp&ter.
4 !ommon error is to try to &se the read metho% on the 0rong thing. 2e!a that
read is a metho% that ony 0or-s on fie ob9e!ts. 'f 0e try to &se the read metho%
on the fie name:
my/file/name B ,dna.txt,
my/contents B my/file/name.read(&
1
2
3
CI Chapter 3: 2ea%ing an% 0riting fies
0e. get an !ttri$uteError 5 #ython 0i !ompain that strings %on.t have a
read metho%
1
:
!ttribute9rror: 3str3 obDect has no attribute 3read3
4nother !ommon error is to &se the file ob@ect 0hen 0e meant to &se the file
contents. 'f 0e try to print the fie ob9e!t:
my/file/name B ,dna.txt,
my/file B open(my/file/name&
print(my/file&
0e 0on.t get an error, b&t 0e. get an o%%/oo-ing ine of o&tp&t:
;open file 3dna.txt3 mode 3r3 at #x7fc$ff77G4b#-
,e 0on.t %is!&ss the meaning of this ine no0: 9&st remember that if yo& try to
print the !ontents of a fie b&t instea% yo& get some o&tp&t that oo-s i-e the
above, yo& have amost %efinitey printe% the fie ob9e!t rather than the fie
!ontents.
1ealing with newlines
+et.s ta-e a oo- at the o&tp&t 0e get 0hen 0e try to print some information from a
fie. ,e. &se the "na=t!t fie from the chapter?) e1er!ises fo%er. $his fie !ontains
a singe ine 0ith a short D;4 se<&en!e. ?pen the fie &p in a te1t e%itor an% ta-e a
oo- at it.
,e.re going to 0rite a simpe program to rea% the D;4 se<&en!e from the fie an%
print it o&t aong 0ith its ength. #&tting together the fie f&n!tions an% metho%s
1 From no0 on, '. 9&st sho0 the reevant bits of o&tp&t 0hen %is!&ssing error message.
C@ Chapter 3: 2ea%ing an% 0riting fies
from this !hapter, an% the materia 0e sa0 in the previo&s !hapter, 0e get the
foo0ing !o%e:
4 open the file
my/file B open(,dna.txt,&
4 read the contents
my/dna B my/file.read(&
4 calculate the length
dna/length B len(my/dna&
4 print the output
print(,seAuence is , C my/dna C , and length is , C str(dna/length&&
,hen 0e oo- at the o&tp&t, 0e !an see that the program is 0or-ing amost
perfe!ty 5 b&t there is something strange: the o&tp&t has been spit over t0o ines:
seAuence is !)+(+!)(+()!)+(!+)
and length is "J
$he e1panation is simpe on!e yo& -no0 it: #ython has in!&%e% the ne0 ine
!hara!ter at the en% of the "na=t!t fie as part of the !ontents. 'n other 0or%s, the
variabe my_dna has a ne0 ine !hara!ter at the en% of it. 'f 0e !o&% vie0 the
my_dna variabe %ire!ty
1
, 0e 0o&% see that it oo-s i-e this:
3!)+(+!)(+()!)+(!+).n3
$he so&tion is aso simpe. (e!a&se this is s&!h a !ommon probem, strings have a
metho% for removing ne0 ines from the en% of them. $he metho% is !ae%
rstrip, an% it ta-es one string arg&ment 0hi!h is the !hara!ter that yo& 0ant to
remove. 'n this !ase, 0e 0ant to remove the ne0ine !hara!ter :\n=. 7ere.s a
mo%ifie% version of the !o%e 5 note that the arg&ment to rstrip is itsef a string
so nee%s to be en!ose% in <&otes:
1 'n fa!t, 0e !an %o this 5 there.s a f&n!tion !ae% repr that ret&rns a representation of a variabe.
F0 Chapter 3: 2ea%ing an% 0riting fies
my/file B open(,dna.txt,&
my/file/contents B my/file.read(&
4 remo0e the ne2line from the end of the file contents
my/dna B my/file/contents.rstrip(,.n,&
dna/length B len(my/dna&
print(,seAuence is , C my/dna C , and length is , C str(dna/length&&
an% no0 the o&tp&t oo-s 9&st as 0e e1pe!te%:
seAuence is !)+(+!)(+()!)+(!+) and length is "G
'n the !o%e above, 0e first rea% the fie !ontents an% then remove% the ne0ine, in
t0o separate steps:
my/file/contents B my/file.read(&
my/dna B my/file/contents.rstrip(,.n,&
b&t it.s more !ommon to rea% the !ontents an% remove the ne0ine a in one go,
i-e this:
my/dna B my/file.read(&.rstrip(,.n,&
$his is a bit tri!-y to rea% at first as 0e are &sing t0o %ifferent metho%s :read an%
rstrip= in the same statement. $he -ey is to rea% it from eft to right 5 0e ta-e the
my_file variabe an% &se the read metho% on it, then 0e ta-e the o&tp&t of that
metho% :0hi!h 0e -no0 is a string= an% &se the rstrip metho% on it. $he res&t
of the rstrip metho% is then store% in the my_dna variabe.
'f yo& fin% it %iffi!&t 0rite the 0hoe thing as one statement i-e this, 9&st brea- it
&p an% %o the t0o things separatey 5 yo&r programs 0i r&n 9&st as 0e.
F1 Chapter 3: 2ea%ing an% 0riting fies
2issing files
,hat happens if 0e try to rea% a fie that %oesn.t e1istE
my/file B open(,nonexistent.txt,&
,e get a ne0 type of error that 0e.ve not seen before:
L>9rror: '9rrno 2* <o such file or directory: 3nonexistent.txt3
'%eay, 0e.% i-e to be abe to !he!- if a fie e1ists before 0e try to open it 5 0e.
see ho0 to %o that in !hapter @. 4ternativey, 0e !an %ea 0ith missing fie errors
0hen they o!!&r: there are many e1ampes in the !hapter on e1!eptions in
A"vance" Python for 8iologists.
Writing te!t to files
4 the e1ampe programs that 0e.ve seen so far in this boo- have pro%&!e% o&tp&t
straight to the s!reen. $hat.s great for e1poring ne0 feat&res an% 0hen 0or-ing on
programs, be!a&se it ao0s yo& to see the effe!t of !hanges to the !o%e right a0ay.
't has a fe0 %ra0ba!-s, ho0ever, 0hen 0riting !o%e that 0e might 0ant to &se in
rea ife.
#rinting o&tp&t to the s!reen ony reay 0or-s 0e 0hen there isn.t very m&!h of
it
1
. 't.s great for short programs an% stat&s messages, b&t <&i!-y be!omes
!&mbersome for arge amo&nts of o&tp&t. "ome terminas str&gge 0ith arge
amo&nts of te1t, or 0orse, have a imite% s!roba!- !apabiity 0hi!h !an !a&se the
first bit of yo&r o&tp&t to %isappear. 't.s not easy to sear!h in o&tp&t that.s being
%ispaye% at the termina, an% ong ines ten% to get 0rappe%. 4so, for many
1 +in&1 &sers may be a0are that 0e !an re%ire!t termina o&tp&t to a fie &sing she re%ire!tion, 0hi!h !an
get aro&n% some of these probems.
F2 Chapter 3: 2ea%ing an% 0riting fies
programs 0e 0ant to sen% %ifferent bits of o&tp&t to %ifferent fies, rather than
having it a %&mpe% in the same pa!e.
Most importanty, termina o&tp&t vanishes 0hen yo& !ose yo&r termina program.
For sma programs i-e the e1ampes in this boo-, that.s not a probem 5 if yo&
0ant to see the o&tp&t again yo& !an 9&st re/r&n the program. 'f yo& have a
program that re<&ires a fe0 ho&rs to r&n, that.s not s&!h a great option.
7pening files for writing
'n the previo&s se!tion, 0e sa0 ho0 to open a fie an% rea% its !ontents. ,e !an
aso open a fie an% write some %ata to it, b&t 0e have to &se the open f&n!tion in a
sighty %ifferent 0ay. $o open a fie for 0riting, 0e &se a t0o/arg&ment version of
the open f&n!tion, 0here the se!on% arg&ment is a spe!iay/formatte% string
%es!ribing 0hat 0e 0ant to %o to the fie
1
. $his se!on% arg&ment !an be >r> for
rea%ing, >0> for 0riting, or >a> for appen%ing
2
. 'f 0e eave o&t the se!on% arg&ment
:i-e 0e %i% for a the e1ampes above=, #ython &ses the %efa&t, 0hi!h is >r> for
rea%ing.
$he %ifferen!e bet0een >0> an% >a> is s&bte, b&t important. 'f 0e open a fie that
area%y e1ists &sing the mo%e >0>, then 0e 0i over0rite the !&rrent !ontents 0ith
0hatever %ata 0e 0rite to it. 'f 0e open an e1isting fie 0ith the mo%e >a>, it 0i
a%% ne0 %ata onto the en% of the fie, b&t 0i not remove any e1isting !ontent. 'f
there %oesn.t area%y e1ist a fie 0ith the spe!ifie% name, then >0> an% >a> behave
i%enti!ay 5 they 0i both !reate a ne0 fie to ho% the o&tp&t.
L&ite a ot of #ython f&n!tions an% metho%s have these optiona arg&ments. For
the p&rposes of this boo-, 0e 0i ony mention them 0hen they are %ire!ty
reevant to 0hat 0e.re %oing. 'f yo& 0ant to see a the optiona arg&ments for a
1 ,e !a this the mo"e of the fie.
2 $hese are the most !ommony/&se% options 5 there are a fe0 others.
F3 Chapter 3: 2ea%ing an% 0riting fies
parti!&ar metho% or f&n!tion, the best pa!e to oo- is the offi!ia #ython
%o!&mentation 5 see !hapter 1 for %etais.
?n!e 0e.ve opene% a fie for 0riting, 0e !an &se the fie -rite metho% to 0rite
some te1t to it. -rite 0or-s a ot i-e print 5 it ta-es a singe string arg&ment /
b&t instea% of printing the string to the s!reen it 0rites it to the fie.
7ere.s ho0 0e &se open 0ith a se!on% arg&ment to open a fie an% 0rite a singe
ine of te1t to it:
my/file B open(,out.txt, ,2,&
my/file.2rite(,1ello 2orld,&
(e!a&se the o&tp&t is being 0ritten to the fie in this e1ampe, yo& 0on.t see any
o&tp&t on the s!reen if yo& r&n it. $o !he!- that the !o%e has 0or-e%, yo&. have to
r&n it, then open &p the fie out=t!t in yo&r te1t e%itor an% !he!- that its !ontents
are 0hat yo& e1pe!t
1
.
2emember that 0ith -rite, 9&st i-e 0ith print, 0e !an &se an+ string as the
arg&ment. $his aso means that 0e !an &se any metho% or f&n!tion that returns a
string. $he foo0ing are a perfe!ty ?K:
4 2rite ,abcdef,
my/file.2rite(,abc, C ,def,&
4 2rite ,G,
my/file.2rite(str(len(3!(+()+!(3&&&
4 2rite ,++(),
my/file.2rite(,!+(),.replace(3!3 3+3&&
4 2rite ,atgc,
my/file.2rite(,!+(),.lo2er(&&
4 2rite contents of my/0ariable
my/file.2rite(my/0ariable&
1 .t1t is the stan%ar% fie name e1tension for a pain te1t fie. +ater in this boo-, 0hen 0e generate o&tp&t
fies 0ith a parti!&ar format, 0e. &se %ifferent fie name e1tensions.
FB Chapter 3: 2ea%ing an% 0riting fies
4losing files
$here.s one more important fie metho% to oo- at before 0e finish this !hapter 5
close. *ns&rprisingy, this is the opposite of open :b&t note that it.s a metho",
0hereas open is a function=. ,e sho&% !a close after 0e.re %one rea%ing or
0riting to a fie 5 0e 0on.t go into the %etais here, b&t it.s a goo% habit to get into
as it avoi%s some types of b&gs that !an be tri!-y to tra!- %o0n
1
. close is an
&n&s&a metho% as it ta-es no arg&ments :so it.s !ae% 0ith an empty pair of
parentheses= an% %oesn.t ret&rn any &sef& va&e:
my/file B open(,out.txt, ,2,&
my/file.2rite(,1ello 2orld,&
4 remember to close the file
my/file.close(&
$here.s aso 0ay of &sing a too !ae% a !onte1t manager to ta-e !are of !osing
fies a&tomati!ay: see the e1!eptions !hapter in A"vance" Python for 8iologists for
more %etais.
Paths an" fol"ers
"o far, 0e have ony %eat 0ith opening fies in the same fo%er that 0e are r&nning
o&r program. ,hat if 0e 0ant to open a fie from a %ifferent part of the fie systemE
$he open f&n!tion is <&ite happy to %ea 0ith fies from any0here on yo&r
!omp&ter, as ong as yo& give the f& path :i.e. the se<&en!e of fo%er names that
tes yo& the o!ation of the fie=. J&st give a file path as the arg&ment rather than a
file name. $he format of the fie path oo-s %ifferent %epen%ing on yo&r operating
system. 'f yo&.re on +in&1, it 0i oo- i-e this:
1 "pe!ifi!ay, it heps to ens&re that o&tp&t to a fie is f&she%, 0hi!h is ne!essary 0hen 0e 0ant to ma-e a
fie avaiabe to another program as part of o&r 0or- fo0.
FC Chapter 3: 2ea%ing an% 0riting fies
my/file B open(,/home/martin/myfolder/myfile.txt,&
if yo&.re on ,in%o0s, i-e this
2
:
my/file B open(r,c:.2indo2s.@es:top.myfolder.myfile.txt,&
an% if yo&.re on a Ma!, i-e this:
my/file B open(,/Msers/martin/@es:top/myfolder/myfile.txt,&
(ecap
,e.ve ta-en a 0hoe !hapter to intro%&!e the vario&s 0ays of rea%ing an% 0riting
to fies, be!a&se it.s s&!h an important part of b&i%ing programs that are &sef& in
bioogy. ,e.ve seen ho0 0or-ing 0ith fie !ontents is a0ays a t0o/step pro!ess 5
0e m&st open a fie before rea%ing or 0riting 5 an% oo-e% at severa !ommon
pitfas. ,e. ret&rn to the theme of fie manip&ation in ater !hapters 0here 0e.
a%%ress some of the short!omings of the te!hni<&es 0e earne% in this !hapter:
• 4 the e1ampes in this !hapter than invove rea%ing fies %o so by rea%ing
a the !ontent into a singe variabe. ?ften, this is not 0hat 0e 0ant 5 a
m&!h more !ommon re<&irement is to pro!ess a fie ine/by/ine. 'n !hapter
B 0e. earn abo&t ists an% oops, 0hi!h 0i ao0 &s to %o e1a!ty that.
• $here are, of !o&rse, many things 0e 0ant to %o to fies besi%es simpy
rea%ing their !ontents. ,e 0o&% aso i-e o&r programs to be abe to move
an% !opy fies, to !reate an% %eete fies an% %ire!tories, an% to ist fies an%
their properties. ,e. !over the toos re<&ire% to %o these things in !hapter
@.
2 $he e1tra r !hara!ter before the string is ne!essary to prevent #ython from trying to interpret the
ba!-sash in the fie pathD see !hapter H for an e1panation.
FF Chapter 3: 2ea%ing an% 0riting fies
• Finay, another feat&re !ommon to a o&r e1ampes is that the names of
fies are 0ritten as part of the !o%e. ,e 0i generay 0ant o&r rea/ife
programs to be more fe1ibe, an% !apabe of rea%ing fies that are spe!ifie%
by the &ser. Chapter @ aso %eas 0ith the vario&s forms of &ser inp&t an% in
it 0e. earn ho0 to ma-e o&r programs a!!ept fie names fe1iby.
FH Chapter 3: 2ea%ing an% 0riting fies
!ercises
3plitting genomic (/5
+oo- in the chapter?) fo%er for a fie !ae% genomic?"na=t!t 5 it !ontains the same
pie!e of genomi! D;4 that 0e 0ere &sing in the fina e1er!ise from !hapter 2. ,rite
a program that 0i spit the genomi! D;4 into !o%ing an% non/!o%ing parts, an%
0rite these se<&en!es to t0o separate fies.
8int: &se yo&r so&tion to the ast e1er!ise from !hapter 2 as a starting point.
$riting a )53T5 file
F4"$4 fie format is a !ommony/&se% D;4 an% protein se<&en!e fie format. 4
singe se<&en!e in F4"$4 format oo-s i-e this:
-seAuence/name
!+)(!)+(!+)(!+)(+!)(!+
,here se<&en!eYname is a hea%er that %es!ribes the se<&en!e :the greater/than
symbo in%i!ates the start of the hea%er ine=. ?ften, the hea%er !ontains an
a!!ession n&mber that reates to the re!or% for the se<&en!e in a p&bi! se<&en!e
%atabase. 4 singe F4"$4 fie !an !ontain m&tipe se<&en!es, i-e this:
-seAuence/one
!+)(!+)(!+)(!+)(!+
-seAuence/t2o
!)+!()+!()+!()!+)(
-seAuence/three
!)+()!+)(!+)(+!))+
FI Chapter 3: 2ea%ing an% 0riting fies
,rite a program that 0i !reate a F4"$4 fie for the foo0ing three se<&en!es 5
ma-e s&re that a se<&en!es are in &pper !ase an% ony !ontain the bases 4, $, R
an% C.
3e2uence ,eader (/5 se2uence
!5C36/ !"C#"!C#!"C#!"C#!"C#C"!#!C#"!"C#
(E7819 actgatcgacgatcgatcgatcacgact
:;<=>? !C"#!C-!C"#"--!C"#"!----C!"#"#
$riting multiple )53T5 files
*se the %ata from the previo&s e1er!ise, b&t instea% of !reating a singe F4"$4 fie,
!reate three ne0 F4"$4 fies 5 one per se<&en!e. $he names of the F4"$4 fies
sho&% be the same as the se<&en!e hea%er names, 0ith the e1tension =fasta.
F@ Chapter 3: 2ea%ing an% 0riting fies
&olutions
3plitting genomic (/5
,e have a hea%/start on this probem, be!a&se 0e have area%y ta!-e% a simiar
probem in the previo&s !hapter. +et.s remin% o&rseves of the so&tion 0e en%e%
&p 0ith for that e1er!ise:
my/dna B
,!+)(!+)(!+)(!+)(!)+(!)+!(+)!+!()+!+()!+(+!()+!)+)(!+)(!+)(!+)(!+)(!+)(!+)
(!+)(!+)(!+)!+()+!+)!+)(!+)(!+!+)(!+()!+)(!)+!)+!+,
exon" B my/dna'#:%2*
intron B my/dna'%2:J#*
exon2 B my/dna'J#:"####*
print(exon" C intron.lo2er(& C exon2&
,hat !hanges %o 0e nee% to ma-eE Firsty, 0e nee% to rea% the D;4 se<&en!e from
a fie instea% of 0riting it in the !o%e:
dna/file B open(,genomic/dna.txt,&
my/dna B dna/file.read(&
"e!on%y, 0e nee% to !reate t0o ne0 fie ob9e!ts to ho% the o&tp&t:
coding/file B open(,coding/dna.txt, ,2,&
noncoding/file B open(,noncoding/dna.txt, ,2,&
Finay, 0e nee% to !on!atenate the t0o e1on se<&en!es an% 0rite them to the
!o%ing D;4 fie, an% 0rite the intron se<&en!e to the non/!o%ing D;4 fie:
coding/file.2rite(exon" C exon2&
noncoding/file.2rite(intron&
H0 Chapter 3: 2ea%ing an% 0riting fies
+et.s p&t it a together, 0ith some ban- ines to separate o&t the %ifferent parts of
the program:
4 open the file and read its contents
dna/file B open(,genomic/dna.txt,&
my/dna B dna/file.read(&
4 extract the different bits of @<! seAuence
exon" B my/dna'#:%2*
intron B my/dna'%2:J#*
exon2 B my/dna'J#:"####*
4 open the t2o output files
coding/file B open(,coding/dna.txt, ,2,&
noncoding/file B open(,noncoding/dna.txt, ,2,&
4 2rite the seAuences to the output files
coding/file.2rite(exon" C exon2&
noncoding/file.2rite(intron&
$riting a )53T5 file
+et.s start this probem by thin-ing abo&t the variabes 0e.re going to nee%. ,e
have three D;4 se<&en!es in tota, so 0e. nee% three variabes to ho% the
se<&en!e hea%ers, an% three more to ho% the se<&en!es themseves:
header/" B ,!5C36/
header_6 @ (E7819
header_/ @ :;<=>?
seA_3 @ !"C#"!C#!"C#!"C#!"C#C"!#!C#"!"C#
seA_6 @ actgatcgacgatcgatcgatcacgact
seA_/ @ !C"#!C-!C"#"--!C"#"!----C!"#"#
F4"$4 format has aternating ines of hea%er an% se<&en!e, so before 0e try any
se<&en!e manip&ation, et.s try to 0rite a program that pro%&!es the ines in the
right or%er. 2ather than 0riting to a fie, 0e. print the o&tp&t to the s!reen for no0
H1 Chapter 3: 2ea%ing an% 0riting fies
5 that 0i ma-e it easier to see the o&tp&t right a0ay. ?n!e 0e.ve got it 0or-ing,
0e. s0it!h over to fie o&tp&t. 7ere.s a fe0 ines 0hi!h 0i print %ata to the
s!reen:
print(header/"&
print(seA/"&
print(header/2&
print(seA/2&
print(header/3&
print(seA/3&
an% here.s 0hat the o&tp&t oo-s i-e:
!N)"23
!+)(+!)(!+)(!+)(!+)()+!(!)(+!+)(
@974$%
actgatcgacgatcgatcgatcacgact
@974$%
actgatcgacgatcgatcgatcacgact
;ot far off 5 the ines are in the right or%er, b&t 0e forgot to in!&%e the greater/
than symbo at the start of the hea%er. 4so, 0e %on.t reay nee% to print the
hea%er an% the se<&en!e separatey for ea!h se<&en!e 5 0e !an in!&%e a ne0ine
!hara!ter in the print string in or%er to get them on separate ines. 7ere.s an
improve% version of the !o%e:
print(3-3 C header/" C 3.n3 C seA/"&
print(3-3 C header/2 C 3.n3 C seA/2&
print(3-3 C header/3 C 3.n3 C seA/3&
an% the o&tp&t oo-s better too:
H2 Chapter 3: 2ea%ing an% 0riting fies
-!N)"23
!+)(+!)(!+)(!+)(!+)()+!(!)(+!+)(
-@974$%
actgatcgacgatcgatcgatcacgact
-1LH7GJ
!)+(!)F!)+(+FF!)+(+!FFFF)!+(+(
;e1t, et.s ta!-e the probems 0ith the se<&en!es. $he se!on% se<&en!e is in o0er
!ase, an% it nee%s to be in &pper !ase 5 0e !an fi1 that &sing the upper string
metho%. $he thir% se<&en!e has a b&n!h of gaps that 0e nee% to remove. ,e
haven.t !ome a!ross a remove metho%.... b&t 0e %o -no0 ho0 to repa!e one
!hara!ter 0ith another. 'f 0e repa!e a the gap !hara!ters 0ith an empty string, it
0i be the same as removing them
1
. 7ere.s a version that fi1es both se<&en!es:
print(3-3 C header/" C 3.n3 C seA/"&
print(3-3 C header/2 C 3.n3 C seA/2.upper(&&
print(3-3 C header/3 C 3.n3 C seA/3.replace(3F3 33&&
;o0 the printe% o&tp&t is perfe!t:
-!N)"23
!+)(+!)(!+)(!+)(!+)()+!(!)(+!+)(
-@974$%
!)+(!+)(!)(!+)(!+)(!+)!)(!)+
-1LH7GJ
!)+(!)!)+(+!)+(+!)!+(+(
$he fina step is to s0it!h from printe% o&tp&t to 0riting to a fie. ,e. open a ne0
fie, an% !hange the three print ines to -rite:
1 4n empty string is 9&st a pair of open an% !ose <&otation mar-s 0ith nothing in bet0een them.
H3 Chapter 3: 2ea%ing an% 0riting fies
output B open(,seAuences.fasta, ,2,&
output.2rite(3-3 C header/" C 3.n3 C seA/"&
output.2rite(3-3 C header/2 C 3.n3 C seA/2.upper(&&
output.2rite(3-3 C header/3 C 3.n3 C seA/3.replace(3F3 33&&
4fter ma-ing these !hanges the !o%e %oesn.t pro%&!e any o&tp&t on the s!reen, so
to see 0hat.s happene% 0e. nee% to ta-e a oo- at the se;uences=fasta fie:
-!N)"23
!+)(+!)(!+)(!+)(!+)()+!(!)(+!+)(-@974$%
!)+(!+)(!)(!+)(!+)(!+)!)(!)+-1LH7GJ
!)+(!)!)+(+!)+(+!)!+(+(
$his %oesn.t oo- right 5 the se!on% an% thir% ines have been 9oine% together, as
have the fo&rth an% fifth. ,hat has happene%E
't oo-s i-e 0e.ve &n!overe% a %ifferen!e bet0een the print f&n!tion an% the
-rite metho%. print a&tomati!ay p&ts a ne0 ine at the en% of the string,
0hereas -rite %oesn.t. $his means 0e.ve got to be !aref& 0hen s0it!hing
bet0een them8 $he fi1 is <&ite simpe, 0e. 9&st a%% a ne0ine onto the en% of ea!h
string that gets 0ritten to the fie:
output B open(,seAuences.fasta, ,2,&
output.2rite(3-3 C header/" C 3.n3 C seA/" C 3.n3&
output.2rite(3-3 C header/2 C 3.n3 C seA/2.upper(& C 3.n3&
output.2rite(3-3 C header/3 C 3.n3 C seA/3.replace(3F3 33& C 3.n3&
$he arg&ments for the 0rite statements are getting <&ite !ompi!ate%, b&t they are
a ma%e &p of simpe b&i%ing bo!-s. For e1ampe the ast one, if 0e transate% it
into )ngish, 0o&% rea% >a greaterBthan symbol0 followe" by the variable hea"er?)0
followe" by a newline0 followe" by the variable se;?) with all hyphens replace" with
nothing0 followe" by another newline>.
HB Chapter 3: 2ea%ing an% 0riting fies
7ere.s the fina !o%e, in!&%ing the variabe %efinition at the beginning, 0ith ban-
ines an% !omments:
4 set the 0alues of all the header 0ariables
header/" B ,!5C36/
header_6 @ (E7819
header_/ @ :;<=>?
B set the Calues of all the seAuence Caria$les
seA_3 @ !"C#"!C#!"C#!"C#!"C#C"!#!C#"!"C#
seA_6 @ actgatcgacgatcgatcgatcacgact
seA_/ @ !C"#!C-!C"#"D!C"#"!----C!"#"#
B make a ne- file to hold the output
output @ open+seAuences.fasta% -,
B -rite the header and seAuence for seA3
output.-rite+)E) F header_3 F )\n) F seA_3 F )\n),
B -rite the header and uppercase seAuences for seA6
output.-rite+)E) F header_6 F )\n) F seA_6.upper+, F )\n),
B -rite the header and seAuence for seA/ -ith hyphens remoCed
output.-rite+)E) F header_/ F )\n) F seA_/.replace+)-)% )), F )\n),
$here.s an e1er!ise that &ses %ifferent te!hni<&es to sove a very simiar probem at
the en% of the !hapter on f&n!tiona programming in A"vance" Python for 8iologists
5 if yo& fin% yo&rsef !arrying o&t this type of pro!ess in rea ife !o%e, then it.s
probaby 0orth a oo-.
$riting multiple )53T5 files
,e !an sove this probem 0ith a sight mo%ifi!ation of o&r so&tion to the previo&s
e1er!ise. ,e. nee% to !reate three ne0 fies to ho% the o&tp&t, an% 0e. !onstr&!t
the name of ea!h fie by &sing string !on!atenation:
HC Chapter 3: 2ea%ing an% 0riting fies
output/" B open(header/" C ,.fasta, ,2,&
output/2 B open(header/2 C ,.fasta, ,2,&
output/3 B open(header/3 C ,.fasta, ,2,&
2emember, the first arg&ment to open is a string, so it.s fine to &se a !on!atenation
be!a&se 0e -no0 that the res&t of !on!atenating t0o strings is aso a string.
,e. aso !hange the -rite statements so that 0e have one for ea!h of the o&tp&t
fies. ,e nee% to be !aref& 0ith the n&mber here in or%er to ma-e s&re that 0e get
the right se<&en!e in ea!h fie. 7ere.s the fina !o%e, 0ith !omments.
4 set the 0alues of all the header 0ariables
header/" B ,!N)"23,
header/2 B ,@974$%,
header/3 B ,1LH7GJ,

4 set the 0alues of all the seAuence 0ariables
seA/" B ,!+)(+!)(!+)(!+)(!+)()+!(!)(+!+)(,
seA/2 B ,actgatcgacgatcgatcgatcacgact,
seA/3 B ,!)+(!)F!)+(+O!)+(+!FFFF)!+(+(,

4 ma:e three files to hold the output
output/" B open(header/" C ,.fasta, ,2,&
output/2 B open(header/2 C ,.fasta, ,2,&
output/3 B open(header/3 C ,.fasta, ,2,&

4 2rite one seAuence to each output file
output/".2rite(3-3 C header/" C 3.n3 C seA/" C 3.n3&
output/2.2rite(3-3 C header/2 C 3.n3 C seA/2.upper(& C 3.n3&
output/3.2rite(3-3 C header/3 C 3.n3 C seA/3.replace(3F3 33& C 3.n3&
+oo-ing at the !o%e above, it seems i-e there.s a ot of re%&n%an!y there. )a!h of
the fo&r se!tions of !o%e 5 setting the hea%er va&es, setting the se<&en!e va&es,
!reating the o&tp&t fies, an% 0riting %ata to the o&tp&t fies 5 !onsists of three
neary/i%enti!a statements. 4tho&gh the so&tion 0or-s, it seems to invove a ot
of &nne!essary typing8 4so, having so m&!h neary/i%enti!a !o%e seems i-ey to
HF Chapter 3: 2ea%ing an% 0riting fies
!a&se errors if 0e nee% to !hange something. 'n the ne1t !hapter, 0e. e1amine
some toos 0hi!h 0i ao0 &s to start removing some of that re%&n%an!y.
HH Chapter B: +ists an% oops
!: "ists and loops
Why "o we nee" lists an" loops?
$hin- ba!- over the e1er!ises that 0e.ve seen in the previo&s t0o !hapters 5 they.ve
a invove% %eaing 0ith one bit of information at a time. 'n !hapter 2, 0e &se%
string manip&ation toos to pro!ess singe se<&en!es, an% in !hapter 3, 0e
pra!tise% rea%ing an% 0riting fies one at a time. $he !osest 0e got to &sing
m&tipe pie!es of %ata 0as %&ring the fina e1er!ise in !hapter 3, 0here 0e 0ere
%eaing 0ith three D;4 se<&en!es.
'f that.s a that #ython ao0e% &s to %o, it 0o&%n.t be a very hepf& too for
bioogy. 'n fa!t, there.s a goo% !han!e that yo&.re rea%ing this boo- be!a&se yo&
0ant to be abe to 0rite programs to hep yo& %ea 0ith arge %atasets. 4 very
!ommon sit&ation in bioogi!a resear!h is to have a arge !oe!tion of %ata :D;4
se<&en!es, ";# positions, gene e1pression meas&rements= that a nee% to be
pro!esse% in the same 0ay. 'n this !hapter, 0e. earn abo&t the f&n%amenta
programming toos that 0i ao0 o&r programs to %o this.
"o far 0e have earne% abo&t severa %ifferent %ata types :strings, n&mbers, an% fie
ob9e!ts=, a of 0hi!h store a singe bit of information
1
. ,hen 0e.ve nee%e% to store
m&tipe bits of information :for e1ampe, the three D;4 se<&en!es in the !hapter
3 e1er!ises= 0e have simpy !reate% more variabes to ho% them:
4 set the 0alues of all the seAuence 0ariables
seA/" B ,!+)(+!)(!+)(!+)(!+)()+!(!)(+!+)(,
seA/2 B ,actgatcgacgatcgatcgatcacgact,
seA/3 B ,!)+(!)F!)+(+O!)+(+!FFFF)!+(+(,
1 ,e -no0 that fies are sighty %ifferent to strings an% n&mbers be!a&se they !an store a ot of information,
b&t ea!h fie ob9e!t sti ony refers to a singe fie.
HI Chapter B: +ists an% oops
$he imitations of this approa!h be!ame !ear <&ite <&i!-y as 0e oo-e% at the
so&tion !o%e 5 it ony 0or-e% be!a&se the n&mber of se<&en!es 0ere sma, an% 0e
-ne0 the n&mber in a%van!e. 'f 0e 0ere to repeat the e1er!ise 0ith three h&n%re%
or three tho&san% se<&en!es, the vast ma9ority of the !o%e 0o&% be given over to
storing variabes an% it 0o&% be!ome !ompetey &nmanageabe. 4n% if 0e 0ere
to try an% 0rite a program that !o&% pro!ess an &n-no0n n&mber of inp&t
se<&en!es :for instan!e, by rea%ing them from a fie=, 0e 0o&%n.t be abe to %o it.
$o ma-e o&r programs abe to pro!ess m&tipe pie!es of %ata, 0e nee% an entirey
ne0 type of str&!t&re 0hi!h !an ho% many pie!es of information at the same time
5 a list.
,e.ve aso %eat e1!&sivey 0ith programs 0hose statements are e1e!&te% from top
to bottom in a very straightfor0ar% 0ay. $his has great a%vantages 0hen first
starting to thin- abo&t programming 5 it ma-es it very easy to foo0 the fo0 of a
program. $he %o0nsi%e of this se<&entia stye of programming, ho0ever, is that it
ea%s to very re%&n%ant !o%e i-e 0e sa0 at the en% of the previo&s !hapter:
4 ma:e three files to hold the output
output/" B open(header/" C ,.fasta, ,2,&
output/2 B open(header/2 C ,.fasta, ,2,&
output/3 B open(header/3 C ,.fasta, ,2,&
4gainD it 0as ony possibe to sove the e1er!ise in this manner be!a&se 0e -ne0 in
a%van!e the n&mber of o&tp&t fies 0e 0ere going to nee%. +oo-ing at the !o%e, it.s
!ear that these three ines !onsist of essentiay the same statement being
e1e!&te% m&tipe times, 0ith some sight variations. $his i%ea of repetition/0ith/
variation is in!re%iby !ommon in programming probems, an% #ython has b&it in
toos for e1pressing it 5 loops.
H@ Chapter B: +ists an% oops
4reating lists an" retrieving elements
$o ma-e a ne0 ist, 0e p&t severa strings or n&mbers
1
insi%e s<&are bra!-ets,
separate% by !ommas:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
conser0ed/sites B '24 $% "32*
)a!h in%ivi%&a item in a ist is !ae% an element. $o get a singe eement from the
ist, 0rite the variabe name foo0e% by the in"e! of the eement yo& 0ant in
s<&are bra!-ets:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
conser0ed/sites B '24 $% "32*
print(apes'#*&
first/site B conser0ed/sites'2*
'f 0e 0ant to go in the other %ire!tion 5 i.e. 0e -no0 0hi!h eement 0e 0ant b&t
0e %on.t -no0 the in%e1 5 0e !an &se the index metho%:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
chimp/index B apes.index(,Pan troglodytes,&
4 chimp/index is no2 "
2emember that in #ython 0e start !o&nting from Gero rather than one, so the first
eement of a ist is a0ays at in%e1 Gero. 'f 0e give a negative n&mber, #ython starts
!o&nting from the end of the ist rather than the beginning 5 so it.s easy to get the
ast eement from a ist:
last/ape B apes'F"*
1 ?r in fa!t, any other type of va&e or variabe
I0 Chapter B: +ists an% oops
,hat if 0e 0ant to get more than one eement from a istE ,e !an give a start an%
stop position, separate% by a !oon, to spe!ify a range of eements:
ran:s B ',:ingdom,,phylum, ,class, ,order, ,family,*
lo2er/ran:s B ran:s'2:$*
4 lo2er ran:s are class order and family
Does this oo- famiiarE 't.s the e1a!t same notation that 0e &se% to get s&bstrings
ba!- in !hapter 2, an% it 0or-s in e1a!ty the same 0ay 5 n&mbers are inclusive at
the start an% exclusive at the en%. $he fa!t that 0e &se the same notation for
strings an% ists hints at a %eeper reationship bet0een the t0o types. 'n fa!t, 0hat
0e 0ere %oing 0hen e1tra!ting s&bstrings in !hapter 2 0as treating a string as
t,oug, it were a list of c,aracters. $his i%ea 5 that 0e !an treat a variabe as
tho&gh it 0ere a ist 0hen it.s not 5 is a po0erf& one in #ython an% 0e. !ome ba!-
to it ater in this !hapter :an% aso in the !hapter on iterators in A"vance" Python
for 8iologists=.
Working with list elements
$o a%% another eement onto the en% of an e1isting ist, 0e !an &se the append
metho%:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
apes.append(,Pan paniscus,&
append is an interesting metho% be!a&se it a!t&ay !hanges the variabe on 0hi!h
it.s &se% 5 in the above e1ampe, the apes ist goes from having three eements to
having fo&r. ,e !an get the ength of a ist by &sing the len f&n!tion, 9&st i-e 0e
%i% for strings:
I1 Chapter B: +ists an% oops
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
print(,+here are , C str(len(apes&& C , apes,&
apes.append(,Pan paniscus,&
print(,<o2 there are , C str(len(apes&& C , apes,&
$he o&tp&t sho0s that the n&mber of eements in apes reay has !hange%:
+here are 3 apes
<o2 there are 4 apes
,e !an !on!atenate t0o ists 9&st as 0e %i% 0ith strings, by &sing the p&s symbo:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
mon:eys B ',Papio ursinus, ,Pacaca mulatta,*
primates B apes C mon:eys
print(str(len(apes&& C , apes,&
print(str(len(mon:eys&& C , mon:eys,&
print(str(len(primates&& C , primates,&
4s 0e !an see from the o&tp&t, this %oesn.t !hange either of the t0o origina ists 5
it ma-es a bran% ne0 ist 0hi!h !ontains eements from both:
3 apes
2 mon:eys
$ primates
'f 0e 0ant to a%% eements from a ist onto the en% of an e1isting ist, !hanging it
in the pro!ess, 0e !an &se the extend metho%. extend behaves i-e append b&t
ta-es a list as its arg&ment rather than a singe element.
7ere are t0o more ist metho%s that !hange the variabe they.re &se% on: reCerse
an% sort. (oth reCerse an% sort 0or- by !hanging the or%er of the eements in
I2 Chapter B: +ists an% oops
the ist. 'f 0e 0ant to print o&t a ist to see ho0 this 0or-s, 0e nee% to &se% str
:9&st as 0e %i% 0hen printing o&t n&mbers=:
ran:s B ',:ingdom,,phylum, ,class, ,order, ,family,*
print(,at the start : , C str(ran:s&&
ran:s.re0erse(&
print(,after re0ersing : , C str(ran:s&&
ran:s.sort(&
print(,after sorting : , C str(ran:s&&
'f 0e ta-e a oo- at the o&tp&t, 0e !an see ho0 the or%er of the eements in the ist
is !hange% by these t0o metho%s:
at the start : '3:ingdom3 3phylum3 3class3 3order3 3family3*
after re0ersing : '3family3 3order3 3class3 3phylum3 3:ingdom3*
after sorting : '3class3 3family3 3:ingdom3 3order3 3phylum3*
(y %efa&t, #ython sorts strings in aphabeti!a or%er an% n&mbers in as!en%ing
n&meri!a or%er
1
.
Writing a loop
'magine 0e 0ante% to ta-e o&r ist of apes:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
an% print o&t ea!h eement on a separate ine, i-e this:
1omo sapiens is an ape
Pan troglodytes is an ape
(orilla gorilla is an ape
1 ,e !an sort in other 0ays too, b&t that.s beyon% the s!ope of this boo-
I3 Chapter B: +ists an% oops
?ne 0ay to %o it 0o&% be to 9&st print ea!h eement separatey:
print(apes'#* C , is an ape,&
print(apes'"* C , is an ape,&
print(apes'2* C , is an ape,&
b&t this is very repetitive an% reies on &s -no0ing the n&mber of eements in the
ist. ,hat 0e nee% is a 0ay to say something aong the ines of >for each element in
the list of apes0 print out the element0 followe" by the wor"s 7 is an ape7>. #ython.s oop
synta1 ao0s &s to e1press those instr&!tions i-e this:
for ape in apes:
print(ape C , is an ape,&
+et.s ta-e a moment to oo- at the %ifferent parts of this oop. ,e start by 0riting
for x in y, 0here y is the name of the ist 0e 0ant to pro!ess an% x is the name
0e 0ant to &se for the !&rrent eement ea!h time ro&n% the oop.
x is 9&st a variabe name :so it foo0s a the r&es that 0e.ve area%y earne% abo&t
variabe names=, b&t it behaves sighty %ifferenty to a the other variabes 0e.ve
seen so far. 'n a previo&s e1ampes, 0e !reate a variabe an% store something in it,
an% then the va&e of that variabe %oesn.t !hange &ness 0e !hange it o&rseves. 'n
!ontrast, 0hen 0e !reate a variabe to be &se% in a oop, 0e %on.t set its va&e 5 the
va&e of the variabe 0i be a&tomati!ay set to ea!h eement of the ist in t&rn,
an% it 0i be %ifferent ea!h time ro&n% the oop.
'mportanty, the oop variabe x ony e1ists insi%e the oop 5 it gets !reate% at the
start of ea!h oop iteration, an% %isappears at the en%. $his means that on!e the
oop has finishe% r&nning for the ast time, that variabe is gone forever. ,hen a
variabe is restri!te% to a bo!- of !o%e i-e this, 0e !a it the variabe.s scope 5 0e
0i see severa more e1ampes ater in the boo-.
IB Chapter B: +ists an% oops
$his first ine of the oop en%s 0ith a !oon, an% a the s&bse<&ent ines :9&st one,
in this !ase= are in%ente%. 'n%ente% ines !an start 0ith any n&mber of tab or spa!e
!hara!ters, b&t they m&st a be in%ente% in the same 0ay. $his pattern 5 a ine
0hi!h en%s 0ith a !oon, foo0e% by some in%ente% ines 5 is very !ommon in
#ython, an% 0e. see it in severa more pa!es thro&gho&t this boo-. 4 gro&p of
in%ente% ines is often !ae% a block of !o%e
1
.
'n this !ase, 0e refer to the in%ente% bo!- as the bo"y of the oop, an% the ines
insi%e it 0i be e1e!&te% on!e for ea!h eement in the ist. $o refer to the !&rrent
eement, 0e &se the variabe name that 0e 0rote in the first ine. $he bo%y of the
oop !an !ontain as many ines as 0e i-e, an% !an in!&%e a the f&n!tions an%
metho%s that 0e.ve earne% abo&t, 0ith one important e1!eption: 0e.re not ao0e%
to !hange the ist 0hie insi%e the bo%y of the oop
2
.
7ere.s an e1ampe of a oop 0ith a more !ompi!ate% bo%y:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
for ape in apes:
name/length B len(ape&
first/letter B ape'#*
print(ape C , is an ape. Lts name starts 2ith , C first/letter&
print(,Lts name has , C str(name/length& C , letters,&
$he bo%y of the oop in the !o%e above has fo&r statements, t0o of 0hi!h are
print statements, so ea!h time ro&n% the oop 0e. get t0o ines of o&tp&t. 'f 0e
oo- at the o&tp&t 0e !an see a si1 ines:
1 'f yo&.re famiiar 0ith any other programming ang&ages, yo& might -no0 !o%e bo!-s as things that are
s&rro&n%e% 0ith !&ry bra!-ets 5 the in%entation %oes the same 9ob in #ython
2 Changing the ist 0hie ooping !an !a&se #ython to be!ome !onf&se% abo&t 0hi!h eements have area%y
been pro!esse% an% 0hi!h are yet to !ome.
IC Chapter B: +ists an% oops
1omo sapiens is an ape. Lts name starts 2ith 1
Lts name has "2 letters
Pan troglodytes is an ape. Lts name starts 2ith P
Lts name has "$ letters
(orilla gorilla is an ape. Lts name starts 2ith (
Lts name has "$ letters
,hy is the above approa!h better than printing o&t these si1 ines in si1 separate
statementsE ,e, for one thing, there.s m&!h ess re%&n%an!y 5 here 0e ony
nee%e% to 0rite t0o print statements. $his aso means that if 0e nee% to ma-e a
!hange to the !o%e, 0e ony have to ma-e it on!e rather than three separate times.
4nother benefit of &sing a oop here is that if 0e 0ant to a%% some eements to the
ist, 0e %on.t have to to&!h the oop !o%e at a. Conse<&enty, it %oesn.t matter
ho0 many eements are in the ist, an% it.s not a probem if 0e %on.t -no0 ho0
many are going to be in it at the time 0hen 0e 0rite the !o%e.
Many probems that !an be sove% 0ith oops !an aso be sove% &sing a too !ae%
ist !omprehensions 5 see the !hapter on !omprehensions in A"vance" Python for
8iologists.
5n"entation errors
*nfort&natey, intro%&!ing toos i-e oops that re<&ire an in%ente% bo!- of !o%e
aso intro%&!es the possibiity of a ne0 type of error 5 an ;ndentationError.
;oti!e 0hat happens 0hen the in%entation of one of the ines in the bo!- %oes not
mat!h the others:
apes B ',1omo sapiens, ,Pan troglodytes, ,(orilla gorilla,*
for ape in apes:
name/length B len(ape&
first/letter B ape'#*
print(ape C , is an ape. Lts name starts 2ith , C first/letter&
print(,Lts name has , C str(name/length& C , letters,&
IF Chapter B: +ists an% oops
,hen 0e r&n this !o%e, 0e get an error message before the program even starts to
r&n:
Lndentation9rror: unindent does not match any outer indentation le0el
,hen yo& en!o&nter an ;ndentationError, go ba!- to yo&r !o%e an% %o&be/
!he!- that a the ines in the bo!- mat!h &p. 4so %o&be/!he!- that yo& are &sing
either tabs or spa!es for in%entation, not bot,. $he easiest 0ay to %o this, as
mentione% in !hapter 1, is to enabe tab emulation in yo&r te1t e%itor.
,sing a string as a list
,e.ve area%y seen ho0 a string !an preten% to be a ist 5 0e !an &se ist in%e1
notation to get in%ivi%&a !hara!ters or s&bstrings from insi%e a string. Can 0e aso
&se oop notation to pro!ess a string as tho&gh it 0ere a istE 3es 5 if 0e 0rite a
oop statement 0ith a string in the position 0here 0e.% normay fin% a ist, #ython
treats eac, c,aracter in the string as a separate eement. $his ao0s &s to very
easiy pro!ess a string one !hara!ter at a time:
name B ,martin,
for character in name:
print(,one character is , C character&
'n this !ase, 0e.re 9&st printing ea!h in%ivi%&a !hara!ter:
one character is m
one character is a
one character is r
one character is t
one character is i
one character is n
IH Chapter B: +ists an% oops
$he pro!ess of repeating a set of instr&!tions for ea!h eement of a ist :or
!hara!ter in a string= is !ae% iteration, an% 0e often ta- abo&t iterating over a ist
or string.
&plitting a string to make a list
"o far in this !hapter, a o&r ists have been 0ritten man&ay. 7o0ever, there are
penty of f&n!tions an% metho%s in #ython that pro%&!e ists as their o&tp&t. ?ne
s&!h metho% that is parti!&ary interesting to bioogists is the split metho%
0hi!h 0or-s on strings. split ta-es a singe arg&ment, !ae% the "elimiter, an%
spits the origina string 0herever it sees the %eimiter, pro%&!ing a ist. 7ere.s an
e1ampe:
names B ,melanogastersimulansya:ubaananassae,
species B names.split(,,&
print(str(species&&
,e !an see from the o&tp&t that the string has been spit 0herever there 0as a
!omma eaving &s 0ith a ist of strings:
'3melanogaster3 3simulans3 3ya:uba3 3ananassae3*
?f !o&rse, on!e 0e.ve !reate% a ist in this 0ay 0e !an iterate over it &sing a oop,
9&st i-e any other ist.
5terating over lines in a file
4nother very &sef& thing that 0e !an iterate over is a fie. J&st as a string !an
preten% to be a ist for the p&rposes of ooping, a fie ob9e!t !an %o the same tri!-
1
.
,hen 0e treat a string as a ist, ea!h !hara!ter be!omes an in%ivi%&a eement, b&t
1 'f yo&.re intereste% in ho0 this >preten%ing> a!t&ay 0or-s, oo- &p the #ython %o!&mentation for iterators
5 b&t be prepare% to %o <&ite a bit of rea%ing8
II Chapter B: +ists an% oops
0hen 0e treat a fie as a ist, ea!h line be!omes an in%ivi%&a eement. $his ma-es
pro!essing a fie ine/by/ine very easy:
file B open(,some/input.txt,&
for line in file:
4 do something 2ith the line
4 <&i!- 0arning: 0hen yo&.re 0riting a program that rea%s %ata from a fie, it.s best
to &se eit,er the read metho% :to store the entire !ontents in a variabe= or the
oop metho% :to %ea 0ith ea!h ine separatey=. 'f yo& try to mi1 them, yo& might
get &ne1pe!te% behavio&r. $he reason for this is that #ython -eeps tra!- of its
position in ea!h fie, so if yo& rea% the !ontents of a fie ob9e!t &sing the read
metho%, an% then ater try to pro!ess it one ine at a time 0ith a oop, yo& 0on.t get
any inp&t be!a&se #ython thin-s it.s area%y at the en% of the fie. 'f yo& abso&tey
have to &se one metho% an% then the other, yo& !an get ro&n% this probem by
!osing an% then re/opening the fie.
6ooping with ranges
"ometimes 0e 0ant to oop over a ist of n&mbers. 'magine 0e have a protein
se<&en!e:
protein B ,0lspad:tn0,
an% 0e 0ant to print o&t the first three resi%&es, then the first fo&r resi%&es, et!.
et!.:
0ls
0lsp
0lspa
0lspad
...etc...
I@ Chapter B: +ists an% oops
?ne 0ay to ta!-e the probem 0o&% be to &se a oop 5 0e !o&% e1tra!t a s&bstring
from the protein se<&en!e an% print it in the bo%y of the oop, an% the ony thing
that 0o&% nee% to !hange is the stop position in the s&bstring. (&t 0hat are 0e
going to iterate overE ,e !an.t 9&st iterate over the protein string, be!a&se that 0i
give &s in%ivi%&a resi%&es, 0hi!h is not 0hat 0e 0ant. ,e !an man&ay assembe a
ist of stop positions, an% oop over that:
stop/positions B '34$%7GJ"#*
for stop in stop/positions:
substring B protein'#:stop*
print(substring&
b&t this seems !&mbersome, an% ony 0or-s if 0e -no0 the ength of the protein
se<&en!e in a%van!e.
4 better so&tion is to &se the range f&n!tion. range is a b&it/in #ython f&n!tion
that generates ists of n&mbers for &s to oop over. $he behavio&r of the range
f&n!tion %epen%s on ho0 many arg&ments 0e give it. (eo0 are a fe0 e1ampes,
0ith the o&tp&t foo0ing %ire!ty after the !o%e.
,ith a singe arg&ment, range 0i !o&nt &p from Gero to that n&mber, e1!&%ing
the n&mber itsef:
for number in range(%&:
print(number&
#
"
2
3
4
$
@0 Chapter B: +ists an% oops
,ith t0o n&mbers, range 0i !o&nt &p from the first n&mber :in!&sive
1
= to the
se!on% :e1!&sive=:
for number in range(3 G&:
print(number&
3
4
$
%
7
,ith three n&mbers, range 0i !o&nt &p from the first to the se!on% 0ith the step
siGe given by the thir%:
for number in range(2 "4 4&:
print(number&
2
%
"#
(ecap
'n this !hapter 0e.ve seen severa toos that 0or- together to ao0 o&r programs to
%ea eeganty 0ith m&tipe pie!es of %ata. +ists et &s store many eements in a
singe variabe, an% oops et &s pro!ess those eements one by one. 'n earning
abo&t oops, 0e.ve aso been intro%&!e% to the bo!- synta1 an% the importan!e of
in%entation in #ython.
1 $he r&es for ranges are the same as for array notation 5 in!&sive on the o0 en%, e1!&sive on the high en%
5 so yo& ony have to memoriGe them on!e8
@1 Chapter B: +ists an% oops
,e.ve aso seen severa &sef& 0ays in 0hi!h 0e !an &se the notation 0e.ve earne%
for 0or-ing 0ith ists 0ith other types of %ata. Depen%ing on the !ir!&mstan!es, 0e
!an &se strings, files, an% ranges as if they 0ere ists. $his is a very hepf& feat&re of
#ython, be!a&se on!e 0e.ve be!ome famiiar 0ith the synta1 for 0or-ing 0ith ists,
0e !an &se it in many %ifferent pa!e. +earning abo&t these toos has aso hepe% &s
ma-e sense of some interesting behavio&r that 0e observe% in earier !hapters.
+ists are the first e1ampe 0e.ve en!o&ntere% of str&!t&res that !an ho% m&tipe
pie!es of %ata. ,e. en!o&nter another s&!h str&!t&re 5 the %i!t 5 in !hapter I. 'n
fa!t, #ython has severa more s&!h %ata types 5 yo&. fin% a f& s&rvey of them in
the !hapter on !ompe1 %ata str&!t&res in A"vance" Python for 8iologists.
@2 Chapter B: +ists an% oops
!ercises
/ote: a the fies mentione% in these e1er!ises !an be fo&n% in the chapter?* fo%er
of the e1er!ises %o0noa%.
Processing (/5 in a file
$he fie input=t!t !ontains a n&mber of D;4 se<&en!es, one per ine. )a!h se<&en!e
starts 0ith the same 1B base pair fragment 5 a se<&en!ing a%apter that sho&% have
been remove%. ,rite a program that 0i :a= trim this a%apter an% 0rite the !eane%
se<&en!es to a ne0 fie an% :b= print the ength of ea!h se<&en!e to the s!reen.
9ultiple exons from genomic (/5
$he fie genomic?"na=t!t !ontains a se!tion of genomi! D;4, an% the fie e!ons=t!t
!ontains a ist of start/stop positions of e1ons. )a!h e1on is on a separate ine an%
the start an% stop positions are separate% by a !omma. ,rite a program that 0i
e1tra!t the e1on segments, !on!atenate them, an% 0rite them to a ne0 fie.
@3 Chapter B: +ists an% oops
&olutions
Processing (/5 in a file
$his seems a bit more !ompi!ate% than previo&s e1er!ises 5 0e are being as-e% to
0rite a program that %oes t0o things at on!e8 5 so ets ta!-e it one step at a time.
First, 0e. 0rite a program that simpy rea%s ea!h se<&en!e from the fie an% prints
it to the s!reen:
file B open(,input.txt,&
for dna in file:
print(dna&
,e !an see from the o&tp&t that 0e.ve forgotten to remove the ne0ines from the
en%s of the D;4 se<&en!es 5 there is a ban- ine bet0een ea!h:
!++)(!++!+!!()+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)

!++)(!++!+!!()!)+(!+)(!+)(!+)(!+)(!+)(!+()+!+)(+)(+
!++)(!++!+!!()!+)(!+)!)(!+)+!+)(+!)(+!+()!+!+)(!+!+)(!+)(+!(+)
!++)(!++!+!!()!)+!+)(!+(!+)+!()+!)(!+)(+!()+(+!
!++)(!++!+!!()!)+!()+!(+)+)(!+()!+(!+)!()++!()+(!+(!+()+!+()!
b&t 0e. ignore that for no0. $he ne1t step is to remove the first 1B bases of ea!h
se<&en!e. ,e -no0 that 0e 0ant to ta-e a s&bstring from ea!h se<&en!e, starting
at the fifteenth !hara!ter, an% !ontin&ing to the en%. *nfort&natey, the se<&en!es
are a %ifferent engths, so the stop position is going to be %ifferent for a of them.
,e. have to !a!&ate the position of the ast !hara!ter for ea!h se<&en!e, by &sing
the len f&n!tion to !a!&ate the ength.
@B Chapter B: +ists an% oops
7ere.s 0hat the !o%e oo-s i-e 0ith the s&bstring part a%%e%:
file B open(,input.txt,&
for dna in file:
last/character/position B len(dna&
trimmed/dna B dna'"4:last/character/position*
print(trimmed/dna&
4s before, 0e are simpy printing the trimme% D;4 se<&en!e to the s!reen, an%
from the o&tp&t 0e !an !onfirm that the first 1B bases have been remove% from
ea!h se<&en!e:
+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)
!)+(!+)(!+)(!+)(!+)(!+)(!+()+!+)(+)(+
!+)(!+)!)(!+)+!+)(+!)(+!+()!+!+)(!+!+)(!+)(+!(+)
!)+!+)(!+(!+)+!()+!)(!+)(+!()+(+!
!)+!()+!(+)+)(!+()!+(!+)!()++!()+(!+(!+()+!+()!
;o0 that 0e -no0 o&r !o%e is 0or-ing, 0e. s0it!h from printing to the s!reen to
0riting to a fie. ,e. have to open the fie before the oop, then 0rite the trimme%
se<&en!es to the fie inside the oop:
file B open(,input.txt,&
output B open(,trimmed.txt, ,2,&
for dna in file:
last/character/position B len(dna&
trimmed/dna B dna'"4:last/character/position*
output.2rite(trimmed/dna&
@C Chapter B: +ists an% oops
?pening &p the trimme"=t!t fie, 0e !an see that the res&t oo-s goo%. 't %i%n.t
matter that 0e never remove% the ne0ines, be!a&se they appear in the !orre!t
pa!e in the o&tp&t fie any0ay:
+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)(!+)
!)+(!+)(!+)(!+)(!+)(!+)(!+()+!+)(+)(+
!+)(!+)!)(!+)+!+)(+!)(+!+()!+!+)(!+!+)(!+)(+!(+)
!)+!+)(!+(!+)+!()+!)(!+)(+!()+(+!
!)+!()+!(+)+)(!+()!+(!+)!()++!()+(!+(!+()+!+()!
;o0 the fina step 5 printing the engths to the s!reen 5 re<&ires 9&st one more ine
of !o%e. 7ere.s the fina program in f&, 0ith !omments:
4 open the input file
file B open(,input.txt,&

4 open the output file
output B open(,trimmed.txt, ,2,&

4 go through the input file one line at a time
for dna in file:

4 calculate the position of the last character
last/character/position B len(dna&

4 get the substring from the "$th character to the end
trimmed/dna B dna'"4:last/character/position*

4 print out the trimmed seAuence
output.2rite(trimmed/dna&

4 print out the length to the screen
print(,processed seAuence 2ith length , C str(len(trimmed/dna&&&
@F Chapter B: +ists an% oops
9ultiple exons from genomic (/5
$his is very simiar to the e1er!ises from the previo&s t0o !hapters, an% so o&r
so&tion to it is going to oo- very simiar. +et.s !on!entrate on the ne0 bit of the
probem first 5 rea%ing the fie of e1on o!ations. 4s before, 0e !an start by opening
&p the fie an% printing ea!h ine to the s!reen:
exon/locations B open(,exons.txt,&
for line in exon/locations:
print(line&
$his gives &s a oop in 0hi!h 0e are %eaing 0ith a %ifferent e1on ea!h time ro&n%.
'f 0e oo- at the o&tp&t, 0e !an see that 0e sti have a ne0ine at the en% of ea!h
ine, b&t 0e. not 0orry abo&t that for no0:
$$G

72"33

"J#27%

34#3JG
;o0 0e have to spit &p ea!h ine into a start an% stop position. $he split metho%
is probaby a goo% !hoi!e for this 9ob 5 et.s see 0hat happens 0hen 0e spit ea!h
ine &sing a !omma as the %eimiter:
exon/locations B open(,exons.txt,&
for line in exon/locations:
positions B line.split(33&
print(positions&
$he o&tp&t sho0s that ea!h ine, 0hen spit, t&rns into a ist of t0o eements:
@H Chapter B: +ists an% oops
'3$3 3$G.n3*
'3723 3"33.n3*
'3"J#3 327%.n3*
'334#3 33JG.n3*
$he se!on% eement of ea!h ist has a ne0ine on the en%, be!a&se 0e haven.t
remove% them. +et.s try assigning the start an% stop position to sensibe variabe
names, an% printing them o&t in%ivi%&ay:
exon/locations B open(,exons.txt,&
for line in exon/locations:
positions B line.split(33&
start B positions'#*
stop B positions'"*
print(,start is , C start C , stop is , C stop&
$he o&tp&t sho0s that this approa!h 0or-s 5 the start an% stop variabes ta-e
%ifferent va&es ea!h time ro&n% the oop:
start is $ stop is $G

start is 72 stop is "33

start is "J# stop is 27%

start is 34# stop is 3JG

;o0 et.s try p&tting these variabes to &se. ,e. rea% the genomi! se<&en!e from
the fie a in one go &sing read 5 there.s no nee% to pro!ess ea!h ine separatey,
as 0e 9&st 0ant the entire !ontents. $hen 0e. &se the e1on !oor%inates to e1tra!t
one e1on ea!h time ro&n% the oop, an% print it to the s!reen:
@I Chapter B: +ists an% oops
genomic/dna B open(,genomic/dna.txt,&.read(&
exon/locations B open(,exons.txt,&
for line in exon/locations:
positions B line.split(33&
start B positions'#*
stop B positions'"*
exon B genomic/dna'start:stop*
print(,exon is: , C exon&
*nfort&natey, 0hen 0e r&n this !o%e 0e get an error at ine H:
7ile ,multiple/exons/from/genomic/dna.py, line 7 in ;module-
exon B genomic/dna'start:stop*
+ype9rror: slice indices must be integers or <one or ha0e an //index//
method
,hat has gone 0rongE 2e!a that the res&t of &sing split on a string is a ist of
strings 5 this means that the start an% stop Caria$les in o&r program are aso
strings :be!a&se they.re 9&st in%ivi%&a eements of the positions ist=. $he
probem !omes 0hen 0e try to &se them as n&mbers in ine H. Fort&natey, it.s
easiy fi1e% 5 0e 9&st have to &se the int f&n!tion to t&rn o&r strings into
n&mbers:
start B int(positions'#*&
stop B int(positions'"*&
an% the program 0or-s as inten%e%.
;e1t step: %oing something &sef& 0ith the e1ons, rather than 9&st printing them to
the s!reen. $he e1er!ise %es!ription says that 0e have to !on!atenate the e1on
se<&en!es to ma-e a ong !o%ing se<&en!e. 'f 0e ha% a the e1ons in separate
variabes, then this 0o&% be easyD
coding/seA B exon" C exon2 C exon3 C exon4
1
2
3
4
5
6
7
8
@@ Chapter B: +ists an% oops
b&t instea% 0e have a singe exon variabe that stores one e1on at a time. 7ere.s
one 0ay to get the !ompete !o%ing se<&en!e: before the oop starts 0e. !reate a
ne0 variabe !ae% coding_seAuence an% assign it to an empty string. $hen,
ea!h time ro&n% the oop, 0e. a%% the !&rrent e1on on to the en%, an% store the
res&t ba!- in the same variabe. ,hen the oop has finishe%, the variabe 0i
!ontain a the e1ons. $his is 0hat the !o%e oo-s i-e :0ith ine n&mbers as the
program is getting <&ite ong=:
genomic/dna B open(,genomic/dna.txt,&.read(&
exon/locations B open(,exons.txt,&
coding/seAuence B ,,
for line in exon/locations:
positions B line.split(33&
start B int(positions'#*&
stop B int(positions'"*&
exon B genomic/dna'start:stop*
coding/seAuence B coding/seAuence C exon
print(,coding seAuence is : , C coding/seAuence&
?n ine 3 0e !reate the coding_seAuence variabe, an% on ine @, insi%e the oop,
0e a%% the !&rrent exon on to the en%. $his is an &n&s&a type of variabe
assignment, be!a&se the coding_seAuence variabe is on both the eft an% right
si%e of the e<&as sign. $he tri!- to &n%erstan%ing ine @ is to rea% the right/han%
si%e of the statement first i.e. >concatenate the current coding_seAuence an" the
current exon0 then store the result of that concatenation in coding_seAuence>.
?n ine 10, instea% of printing the e1on, 0e.re printing the !o%ing se<&en!e, an% 0e
!an see from the o&tp&t ho0 the !o%ing se<&en!e is gra%&ay b&it &p as 0e go
ro&n% the oop:
1
2
3
4
5
6
7
8
9
10
100 Chapter B: +ists an% oops
coding seAuence is : )(+!))(+)(!)(!+()+!)(!+)(+)(!+)(+!(+)(!+)!+)(!+)(!+)(
coding seAuence is :
)(+!))(+)(!)(!+()+!)(!+)(+)(!+)(+!(+)(!+)!+)(!+)(!+)()(!+)(!+)(!+!+)(!+)(!
+!+)!+)(!+()!+)(!+)!+)(!+)(!+)(!+)(!+)(!
coding seAuence is :
)(+!))(+)(!)(!+()+!)(!+)(+)(!+)(+!(+)(!+)!+)(!+)(!+)()(!+)(!+)(!+!+)(!+)(!
+!+)!+)(!+()!+)(!+)!+)(!+)(!+)(!+)(!+)(!)(!+)(!+)(!+)(+!()+!()+!()+!(!+)(!
+)!+)!+)(+!()+!()+)(!)+!()+!)(+!)(!+)(!+()!+)(!+)(+!
coding seAuence is :
)(+!))(+)(!)(!+()+!)(!+)(+)(!+)(+!(+)(!+)!+)(!+)(!+)()(!+)(!+)(!+!+)(!+)(!
+!+)!+)(!+()!+)(!+)!+)(!+)(!+)(!+)(!+)(!)(!+)(!+)(!+)(+!()+!()+!()+!(!+)(!
+)!+)!+)(+!()+!()+)(!)+!()+!)(+!)(!+)(!+()!+)(!+)(+!)(!+)(!+)(!+)(!+)(!+)(
!+)(!+)(!+)(!+)(!+)(+!()+!()+!)(!+)(
$he fina step is to save the !o%ing se<&en!e to a fie. ,e !an %o this at the en% of
the program 0ith three ines of !o%e. 7ere.s the fina !o%e 0ith !omments:
101 Chapter B: +ists an% oops
4 open the genomic dna file and read the contents
genomic/dna B open(,genomic/dna.txt,&.read(&

4 open the exons locations file
exon/locations B open(,exons.txt,&

4 create a 0ariable to hold the coding seAuence
coding/seAuence B ,,

4 go through each line in the exon locations file
for line in exon/locations:

4 split the line using a comma
positions B line.split(33&

4 get the start and stop positions
start B int(positions'#*&
stop B int(positions'"*&

4 extract the exon from the genomic dna
exon B genomic/dna'start:stop*

4 append the exon to the end of the current coding seAuence
coding/seAuence B coding/seAuence C exon

4 2rite the coding seAuence to an output file
output B open(,coding/seAuence.txt, ,2,&
output.2rite(coding/seAuence&
output.close(&
102 Chapter C: ,riting o&r o0n f&n!tions
: $riting our own functions
Why "o we want to write our own functions?
$a-e a oo- ba!- at the very first e1er!ise in this boo- 5 the one in !hapter 2 0here
0e ha% to 0rite a program to !a!&ate the 4$ !ontent of a D;4 se<&en!e. +et.s
remin% o&rseves of the !o%e:
my/dna B ,!)+(!+)(!++!)(+!+!(+!+++()+!+)!+!)!+!+!+!+)(!+()(++)!+,
length B len(my/dna&
a/count B my/dna.count(3!3&
t/count B my/dna.count(3+3&
at/content B (a/count C t/count& / length
print(,!+ content is , C str(at/content&&
'f 0e %is!o&nt the first ine :0hose 9ob is to store the inp&t se<&en!e= an% the ast
ine :0hose 9ob is to print the res&t=, 0e !an see that it ta-es fo&r ines of !o%e to
!a!&ate the 4$ !ontent
1
. $his means that every pa!e in o&r !o%e 0here 0e 0ant
to !a!&ate the 4$ !ontent of a se<&en!e, 0e nee% these same fo&r ines 5 an% 0e
have to ma-e s&re 0e !opy them e1a!ty, 0itho&t any mista-es.
't 0o&% be m&!h simper if #ython ha% a b&it/in f&n!tion :et.s !a it
get_at_content= for !a!&ating 4$ !ontent. 'f that 0ere the !ase, then 0e !o&%
9&st r&n get_at_content in the same 0ay 0e r&n print, or len, or open.
4tho&gh, sa%y, #ython %oes not have s&!h a b&it/in f&n!tion, it %oes have the
ne1t best thing 5 a 0ay for &s to !reate o&r o0n f&n!tions.
Creating o&r o0n f&n!tion to !arry o&t a parti!&ar 9ob has many benefits. 't ao0s
&s to re/&se the same !o%e many times 0ithin a program 0itho&t having to !opy it
o&t ea!h time. 4%%itionay, if 0e fin% that 0e have to ma-e a !hange to the !o%e,
1 ,e !o&%, of !o&rse, !ompress this %o0n to a singe ine:
at_content @ +my_dna.count+)!), F my_dna.count+)"),, G len+my_dna,
b&t it 0o&% be m&!h ess rea%abe.
1
2
3
4
5
6
103 Chapter C: ,riting o&r o0n f&n!tions
0e ony have to %o it in one pa!e. "pitting o&r !o%e into f&n!tions aso ao0s &s
to ta!-e arger probems, as 0e !an 0or- on %ifferent bits of the !o%e
in%epen%enty. ,e !an aso re/&se !o%e a!ross m&tipe programs.
1efining a function
+et.s go ahea% an% !reate o&r get_at_content f&n!tion. (efore 0e start typing,
0e nee% to fig&re o&t 0hat the inp&ts :the n&mber an% types of the function
arguments= an% o&tp&ts :the type of the return value= are going to be. For this
f&n!tion, that seems pretty obvio&s 5 the inp&t is going to be a singe D;4
se<&en!e, an% the o&tp&t is going to be a %e!ima n&mber. $o transate these into
#ython terms: the f&n!tion 0i ta-e a singe arg&ment of type string, an% 0i
ret&rn a va&e of type number
1
. 7ere.s the !o%e:
def get/at/content(dna&:
length B len(dna&
a/count B dna.count(3!3&
t/count B dna.count(3+3&
at/content B (a/count C t/count& / length
return at/content
Reminder: if yo&.re &sing #ython 2 rather than #ython 3, in!&%e this ine at the
top of yo&r program:
from //future// import di0ision
$he first ine of the f&n!tion %efinition !ontains a severa %ifferent eements. ,e
start 0ith the 0or% def, 0hi!h is short for "efine :0riting a f&n!tion is !ae%
"efining it=. Foo0ing that 0e 0rite the name of the f&n!tion, foo0e% by the
names of the arg&ment variabes in parentheses. J&st i-e 0e sa0 before 0ith
1 'n fa!t, 0e !an be a itte bit more spe!ifi!: 0e !an say that the ret&rn va&e 0i be of type float 5 a
foating point n&mber :i.e. one 0ith a %e!ima point=.
10B Chapter C: ,riting o&r o0n f&n!tions
norma variabes, the f&n!tion name an% the arg&ment names are arbitrary 5 0e
!an &se 0hatever 0e i-e.
$he first ine en%s 0ith a !oon, 9&st i-e the first ine of the oops that 0e 0ere
oo-ing at in the previo&s !hapter. 4n% 9&st i-e oops, this ine is foo0e% by a
block of in%ente% ines 5 the function bo"y. $he f&n!tion bo%y !an have as many
ines of !o%e as 0e i-e, as ong as they a have the same in%entation. ,ithin the
f&n!tion bo%y, 0e !an refer to the arg&ments by &sing the variabe names from the
first ine. 'n this !ase, the variabe dna refers to the se<&en!e that 0as passe% in as
the arg&ment to the f&n!tion.
$he ast ine of the f&n!tion !a&ses it to ret&rn the 4$ !ontent that 0as !a!&ate%
in the f&n!tion bo%y. $o return from a f&n!tion, 0e simpy 0rite ret&rn foo0e%
by the va&e that the f&n!tion sho&% o&tp&t.
$here are a !o&pe of important things to be a0are of 0hen 0riting f&n!tions.
Firsty, 0e nee% to ma-e a !ear %istin!tion bet0een "efining a f&n!tion, an%
running it :0e refer to r&nning a f&n!tion as calling it=. $he !o%e 0e.ve 0ritten
above 0i not !a&se anything to happen 0hen 0e r&n it, be!a&se 0e.ve not a!t&ay
as-e% #ython to e1e!&te the get_at_content f&n!tion 5 0e have simpy %efine%
0hat it is. $he !o%e in the f&n!tion 0i not be e1e!&te% &nti 0e !a the f&n!tion
i-e this:
get/at/content(,!+(!)+((!))!,&
'f 0e simpy !a the f&n!tion i-e that, ho0ever, then the 4$ !ontent 0i vanish
on!e it.s been !a!&ate%. 'n or%er to &se the f&n!tion to %o something &sef&, 0e
m&st either store the res&t in a variabe:
at/content B get/at/content(,!+(!)+((!))!,&
?r &se it %ire!ty:
10C Chapter C: ,riting o&r o0n f&n!tions
print(,!+ content is , C str(get/at/content(,!+(!)+((!))!,&&&
"e!on%y, it.s important to &n%erstan% that the arg&ment variabe dna %oes not
ho% any parti!&ar va&e 0hen the f&n!tion is %efine%
1
. 'nstea%, its 9ob is to ho%
0hatever va&e is given as the arg&ment 0hen the f&n!tion is !ae%. 'n this 0ay it.s
anaogo&s to the oop variabes 0e sa0 in the previo&s !hapter: oop variabes ho%
a %ifferent va&e ea!h time ro&n% the oop, an% f&n!tion arg&ment variabes ho% a
%ifferent va&e ea!h time the f&n!tion is !ae%.
Finay, be a0are that the same s!oping r&es that appie% to oops aso appy to
f&n!tions 5 any variabes that 0e !reate as part of the f&n!tion ony e1ist insi%e the
f&n!tion, an% !annot be a!!esse% o&tsi%e. 'f 0e try to &se a variabe that.s !reate%
insi%e the f&n!tion from o&tsi%e:
def get/at/content(dna&:
length B len(dna&
a/count B dna.count(3!3&
t/count B dna.count(3+3&
at/content B (a/count C t/count& / length
return at/content
print(a/count&
,e. get an error:
<ame9rror: name 3a/count3 is not defined
1 'n%ee%, it %oesn.t a!t&ay e1ist 0hen it.s %efine%, ony 0hen it r&ns.
10F Chapter C: ,riting o&r o0n f&n!tions
4alling an" improving our function
+et.s 0rite a sma program that &ses o&r ne0 f&n!tion, to see ho0 it 0or-s. ,e.
try both storing the res&t in a variabe before printing it :ines I an% @= an%
printing it %ire!ty :ines 10 an% 11=:
def get/at/content(dna&:
length B len(dna&
a/count B dna.count(3!3&
t/count B dna.count(3+3&
at/content B (a/count C t/count& / length
return at/content
my/at/content B get/at/content(,!+()()(!+)(!+)(!!+)(,&
print(str(my/at/content&&
print(get/at/content(,!+()!+()!!)+(+!(),&&
print(get/at/content(,aactgtagctagctagcagcgta,&&
+oo-ing at the o&tp&t, 0e !an see that the first f&n!tion !a 0or-s fine 5 the 4$
!ontent is !a!&ate% to be 0.BC, is store% in the variabe myYatY!ontent, then
printe%. 7o0ever, the o&tp&t for the ne1t t0o !as is not so great. $he !a at ine
10 pro%&!es a n&mber 0ith 0ay too many fig&res after the %e!ima point, an% the
!a at ine 11, 0ith the inp&t se<&en!e in o0er !ase, gives a res&t of 0.0, 0hi!h is
%efinitey not !orre!t:
#.4$
#.$2J4""7%47#$GG24
#.#
,e. fi1 these probems by ma-ing a !o&pe of !hanges to the get_at_content
f&n!tion. ,e !an a%% a ro&n%ing step in or%er to imit the n&mber of signifi!ant
fig&res in the res&t. #ython has a b&it/in round f&n!tion that ta-es t0o
arg&ments 5 the n&mber 0e 0ant to ro&n%, an% the n&mber of signifi!ant fig&res.
,e. !a the round f&n!tion on the res&t before 0e ret&rn it. 4n% 0e !an fi1 the
1
2
3
4
5
6
7
8
9
10
11
10H Chapter C: ,riting o&r o0n f&n!tions
o0er !ase probem by !onverting the inp&t se<&en!e to &pper !ase before starting
the !a!&ation. 7ere.s the ne0 version of the f&n!tion, 0ith the same three
f&n!tion !as:
def get/at/content(dna&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return round(at/content 2&
my/at/content B get/at/content(,!+()()(!+)(!+)(!!+)(,&
print(str(my/at/content&&
print(get/at/content(,!+()!+()!!)+(+!(),&&
print(get/at/content(,aactgtagctagctagcagcgta,&&
an% no0 the o&tp&t is 9&st as 0e 0ant:
#.4$
#.$3
#.$2
,e !an ma-e the f&n!tion even better tho&gh: 0hy not ao0 it to be !ae% 0ith
the n&mber of signifi!ant fig&res as an arg&ment
1
E 'n the above !o%e, 0e.ve pi!-e%
t0o signifi!ant fig&res, b&t there might be sit&ations 0here 0e 0ant to see more.
4%%ing the se!on% arg&ment is easyD 0e 9&st a%% it to the arg&ment variabe ist on
the first ine of the f&n!tion %efinition, an% then &se the ne0 arg&ment variabe in
the !a to round. ,e. thro0 in a fe0 !as to the ne0 version of the f&n!tion 0ith
%ifferent arg&ments to !he!- that it 0or-s:
1 4n even better so&tion 0o&% be to spe!ify the n&mber of signifi!ant fig&res in the string representation
of the n&mber 0hen it.s printe%.
10I Chapter C: ,riting o&r o0n f&n!tions
def get/at/content(dna sig/figs&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return round(at/content sig/figs&
test/dna B ,!+()!+()!!)+(+!(),
print(get/at/content(test/dna "&&
print(get/at/content(test/dna 2&&
print(get/at/content(test/dna 3&&
$he o&tp&t !onfirms that the ro&n%ing 0or-s as inten%e%:
#.$
#.$3
#.$2J
ncapsulation with functions
+et.s pa&se for a moment an% !onsi%er 0hat happene% in the previo&s se!tion. ,e
0rote a f&n!tion, an% then 0rote some !o%e that &se% that f&n!tion. 'n the pro!ess
of 0riting the !o%e that &se% the f&n!tion, 0e %is!overe% a !o&pe of probems 0ith
o&r origina f&n!tion %efinition. $e were t,en able to go bac1 and c,ange t,e
function definition* wit,out ,aving to ma1e an+ c,anges to t,e code t,at
used t,e function.
'.ve 0ritten that ast senten!e in bo%, be!a&se it.s in!re%iby important. 't.s no
e1aggeration to say that &n%erstan%ing the impi!ations of that senten!e is the -ey
to being abe to 0rite arger, &sef& programs. $he reason it.s so important is that it
%es!ribes a programming phenomenon that 0e !a encapsulation. )n!aps&ation
9&st means %ivi%ing &p a !ompe1 program into itte bits 0hi!h 0e !an 0or- on
in%epen%enty. 'n the e1ampe above, the !o%e is %ivi%e% into t0o parts 5 the part
10@ Chapter C: ,riting o&r o0n f&n!tions
0here 0e %efine the f&n!tion, an% the part 0here 0e &se it 5 an% 0e !an ma-e
!hanges to one part 0itho&t 0orrying abo&t the effe!ts on the other part.
$his is a very po0erf& i%ea, be!a&se 0itho&t it, the siGe of programs 0e !an 0rite is
imite% to the n&mber of ines of !o%e 0e !an ho% in o&r hea% at one time. "ome of
the e1ampe !o%e in the so&tions to e1er!ises in the previo&s !hapter 0ere starting
to p&sh at this imit area%y, even for reativey simpe probems. (y !ontrast, &sing
f&n!tions ao0s &s to b&i% &p a !ompe1 program from sma b&i%ing bo!-s, ea!h
of 0hi!h in%ivi%&ay is sma eno&gh to &n%erstan% in its entirety.
(e!a&se &sing f&n!tions is so important, f&t&re so&tions to e1er!ises 0i &se them
0hen appropriate, even 0hen it.s not e1pi!ity mentione% in the probem te1t. '
en!o&rage yo& to get into the habit of &sing f&n!tions in yo&r so&tions too.
/unctions "on7t always have to take an argument
$here.s nothing in the r&es of #ython to say that yo&r f&n!tion must ta-e an
arg&ment. 't.s perfe!ty possibe to %efine a f&n!tion 0ith no arg&ments:
def get/a/number(&:
return 42
b&t s&!h f&n!tions ten% not to be very &sef&. For e1ampe, 0e !an 0rite a version
of get_at_content that %oesn.t re<&ire any arg&ments by setting the va&e of
the dna variabe insi%e the f&n!tion:
def get/at/content(&:
dna B ,!)+(!+()+!()+!,
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return round(at/content 2&
110 Chapter C: ,riting o&r o0n f&n!tions
b&t that.s obvio&sy not very &sef&. ?!!asionay yo& may be tempte% to 0rite a
no/arg&ment f&n!tion that 0or-s i-e this:
def get/at/content(&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return round(at/content 2&
dna B ,!)+(!+)(!+)(,
print(get/at/content(&&
4t first this seems i-e a goo% i%ea 5 it 0or-s be!a&se the f&n!tion gets the va&e of
the dna variabe that is set on ine I
1
. 7o0ever, this brea-s the en!aps&ation that
0e 0or-e% so har% to a!hieve. $he f&n!tion no0 ony 0or-s if there is a variabe
!ae% dna set in the bit of the !o%e 0here the f&n!tion is !ae%, so the t0o pie!es
of !o%e are no onger in%epen%ent.
'f yo& fin% yo&rsef 0riting !o%e i-e this, it.s &s&ay a goo% i%ea to i%entify 0hi!h
variabes from o&tsi%e the f&n!tion are being &se% insi%e it, an% t&rn them into
arg&ments.
1 't %oesn.t matter that the variabe is set after the f&n!tion is %efine% 5 a that matters it that it.s set before
the f&n!tion is !ae% on ine @.
1
2
3
4
5
6
7
8
9
111 Chapter C: ,riting o&r o0n f&n!tions
/unctions "on7t always have to return a value
Consi%er this variation of o&r f&n!tion 5 instea% of returning the 4$ !ontent, this
f&n!tion prints it to the s!reen:
def print/at/content(dna&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
print(str(round(at/content 2&&&
,hen yo& first start 0riting f&n!tions, it.s very tempting to %o this -in% of thing.
3o& thin- >>C0 5 nee" to calculate an" print the A' content < 57ll write a function that
"oes both>. $he tro&be 0ith this approa!h is that it res&ts in a f&n!tion that is ess
fe1ibe. 2ight no0 yo& 0ant to print the 4$ !ontent to the s!reen, b&t 0hat if yo&
ater %is!over that yo& 0ant to 0rite it to a fie, or &se it as part of some other
!a!&ationE 3o&. have to 0rite more f&n!tions to !arry o&t these tas-s.
$he -ey to %esigning fe1ibe f&n!tions is to re!ogniGe that the 9ob calculate an"
print the A' content is a!t&ay t0o separate 9obs 5 !a!&ating the 4$ !ontent, an%
printing it. $ry to 0rite yo&r f&n!tions in s&!h a 0ay that they 9&st %o one 9ob. 3o&
!an then easiy 0rite !o%e to !arry o&t more !ompi!ate% 9obs by &sing yo&r simpe
f&n!tions as b&i%ing bo!-s.
/unctions can be calle" with name" arguments
,hat %o 0e nee% to -no0 abo&t a f&n!tion in or%er to be abe to &se itE ,e nee% to
-no0 0hat the ret&rn va&e an% type is, an% 0e nee% to -no0 the n&mber an% type
of the arg&ments. For the e1ampes 0e.ve seen so far in this boo-, 0e aso nee% to
-no0 the order of the arg&ments. For instan!e, to &se the open f&n!tion 0e nee%
to -no0 that the name of the fie !omes first, foo0e% by the mo%e of the fie. 4n%
to &se o&r t0o/arg&ment version of get_at_content as %es!ribe% above, 0e nee%
112 Chapter C: ,riting o&r o0n f&n!tions
to -no0 that the D;4 se<&en!e !omes first, foo0e% by the n&mber of signifi!ant
fig&res.
$here.s a feat&re in #ython !ae% keywor" arguments 0hi!h ao0s &s to !a
f&n!tions in a sighty %ifferent 0ay. 'nstea% of giving a ist of arg&ments in
parentheses:
get/at/content(,!+)(+(!)+)(, 2&
0e !an s&ppy a ist of arg&ment variabe names an% va&es 9oine% by e<&as signs:
get/at/content(dnaB,!+)(+(!)+)(, sig/figsB2&
$his stye of !aing f&n!tions
1
has severa a%vantages. 't %oesn.t rey on the or%er
of arg&ments, so 0e !an &se 0hi!hever or%er 0e prefer. $hese t0o statements
behave i%enti!ay:
get/at/content(dnaB,!+)(+(!)+)(, sig/figsB2&
get/at/content(sig/figsB2 dnaB,!+)(+(!)+)(,&
't.s aso !earer to rea% 0hat.s happening 0hen the arg&ment names are given
e1pi!ity.
,e !an even mi1 an% mat!h the t0o styes of !aing 5 the foo0ing are a
i%enti!a:
get/at/content(,!+)(+(!)+)(, 2&
get/at/content(dnaB,!+)(+(!)+)(, sig/figsB2&
get/at/content(,!+)(+(!)+)(, sig/figsB2&
4tho&gh 0e.re not ao0e% to start off 0ith -ey0or% arg&ments then s0it!h ba!-
to norma 5 this 0i !a&se an error:
1 't 0or-s 0ith metho%s too, in!&%ing a the ones 0e.ve seen so far.
113 Chapter C: ,riting o&r o0n f&n!tions
get/at/content(dnaB,!+)(+(!)+)(, 2&
Key0or% arg&ments !an be parti!&ary &sef& for f&n!tions an% metho%s that have
a ot of arg&ments, an% 0e. &se them 0here appropriate in the e1ampes an%
e1er!ise so&tions in the rest of this boo-.
/unction arguments can have "efaults
,e.ve en!o&ntere% f&n!tion arg&ments 0ith %efa&ts before, 0hen 0e 0ere
%is!&ssing opening fies in !hapter 3. 2e!a that the open f&n!tion ta-es t0o
arg&ments 5 a fie name an% a mo%e string 5 b&t that if 0e !a it 0ith :ust a fie
name it &ses a %efa&t va&e for the mo%e string. ,e !an easiy ta-e a%vantage of
this feat&re in o&r o0n f&n!tions 5 0e simpy spe!ify the %efa&t va&e in the first
ine of the f&n!tion %efinition. 7ere.s a version of o&r get_at_content f&n!tion
0here the %efa&t n&mber of signifi!ant fig&res is t0o:
def get/at/content(dna sig/figsB2&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return round(at/content sig/figs&
;o0 0e have the best of both 0or%s. 'f the f&n!tion is !ae% 0ith t0o arg&ments,
it 0i &se the n&mber of signifi!ant fig&res spe!ifie%D if it.s !ae% 0ith one
arg&ment, it 0i &se the %efa&t va&e of t0o signifi!ant fig&res. +et.s see some
e1ampes:
get/at/content(,!+)(+(!)+)(,&
get/at/content(,!+)(+(!)+)(, 3&
get/at/content(,!+)(+(!)+)(, sig/figsB4&
11B Chapter C: ,riting o&r o0n f&n!tions
$he f&n!tion ta-es !are of fiing in the %efa&t va&e for sig_figs for the first
f&n!tion !a 0here none is s&ppie%:
#.4$
#.4$$
#.4$4$
F&n!tion arg&ment %efa&ts ao0 &s to 0rite very fe1ibe f&n!tions 0hi!h !an
have varying n&mbers of arg&ments. 't ony ma-es sense to &se them for arg&ments
0here a sensibe %efa&t !an be !hosen 5 there.s no point spe!ifying a %efa&t for
the dna arg&ment in o&r e1ampe. $hey are parti!&ary &sef& for f&n!tions 0here
some of the options are ony going to be &se% infre<&enty.
'esting functions
,hen 0riting !o%e of any type, it.s important to perio%i!ay !he!- that yo&r !o%e
%oes 0hat yo& inten% it to %o. 'f yo& oo- ba!- over the so&tions to e1er!ises from
the first fe0 !hapters, yo& !an see that 0e generay test o&r !o%e at ea!h step by
printing some o&tp&t to the s!reen an% !he!-ing that it oo-s ?K. For e1ampe, in
!hapter 2 0hen 0e 0ere first !a!&ating 4$ !ontent, 0e &se% a very short test
se<&en!e to verify that o&r !o%e 0or-e% before r&nning it on the rea inp&t.
$he reason 0e &se% a test se<&en!e 0as that, be!a&se it 0as so short, 0e !o&%
easiy 0or- o&t the ans0er by eye an% !ompare it to the ans0er given by o&r !o%e.
$his i%ea 5 r&nning !o%e on a test inp&t an% !omparing the res&t to an ans0er
t,at we 1now to be correct
1
5 is s&!h a &sef& one that #ython has a b&it/in too
for e1pressing it: assert. 4n assertion !onsists of the 0or% assert, foo0e% by a
!a to o&r f&n!tion, then two e<&as signs, then the res&t that 0e e1pe!t
2
.
1 $hin- of it as simiar to r&nning a positive !ontro in a 0et/ab e1periment.
2 'n fa!t, assertions !an in!&%e any !on%itiona statementD 0e. earn abo&t those in the ne1t !hapter.
11C Chapter C: ,riting o&r o0n f&n!tions
For e1ampe, 0e -no0 that if 0e r&n o&r get_at_content f&n!tion on the D;4
se<&en!e >4$RC> 0e sho&% get an ans0er of 0.C. $his assertion 0i test 0hether
that.s the !ase:
assert get/at/content(,!+(),& BB #.$
;oti!e the t0o e<&as signs 5 0e. earn the reason behin% that in the ne1t !hapter.
$he 0ay that assertion statements 0or- is very simpeD if an assertion t&rns o&t to
be fase :i.e. if #ython e1e!&tes o&r f&n!tion on the inp&t >4$RC> an% the ans0er
isn.t 0.C= then the program 0i stop an% 0e 0i get an !ssertionError.
4ssertions are &sef& in a n&mber of 0ays. $hey provi%e a means for &s to !he!-
0hether o&r f&n!tions are 0or-ing as inten%e% an% therefore hep &s tra!- %o0n
errors in o&r programs. 'f 0e get some &ne1pe!te% o&tp&t from a program that &ses
a parti!&ar f&n!tion, an% the assertion tests for that f&n!tion a pass, then 0e !an
be !onfi%ent that the error %oesn.t ie in the f&n!tion b&t in the !o%e that !as it.
$hey aso et &s mo%ify a f&n!tion an% !he!- that 0e haven.t intro%&!e% any errors.
'f 0e have a f&n!tion that passes a series of assertion tests, an% 0e ma-e some
!hanges to it, 0e !an re/r&n the assertion tests an%, ass&ming they a pass, be
!onfi%ent that 0e haven.t bro-en the f&n!tion
1
.
4ssertions are aso &sef& as a form of %o!&mentation. (y in!&%ing a !oe!tion of
assertion tests aongsi%e a f&n!tion, 0e !an sho0 e1a!ty 0hat o&tp&t is e1pe!te%
from a given inp&t.
Finay, 0e !an &se assertions to test the behavio&r of o&r f&n!tion for &n&s&a
inp&ts. For e1ampe, 0hat is the e1pe!te% behavio&r of get_at_content 0hen
given a D;4 se<&en!e that in!&%es &n-no0n bases :&s&ay represente% as N=E 4
sensibe 0ay to han%e &n-no0n bases 0o&% be to e1!&%e them from the 4$
!ontent !a!&ation 5 in other 0or%s, the 4$ !ontent for a given se<&en!e sho&%n.t
1 $his i%ea is very simiar to a pro!ess in soft0are %eveopment !ae% regression testing.
11F Chapter C: ,riting o&r o0n f&n!tions
be affe!te% by a%%ing a b&n!h of &n-no0n bases. ,e !an 0rite an assertion that
e1presses this:
assert get/at/content(,!+()<<<<<<<<<<,& BB #.$
$his assertions fais for the !&rrent version of get_at_content. 7o0ever, 0e !an
easiy mo%ify the f&n!tion to remove a N !hara!ters before !arrying o&t the
!a!&ation:
def get/at/content(dna sig/figsB2&:
dna B dna.replace(3<3 33&
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return round(at/content sig/figs&
an% no0 the assertion passes.
't.s !ommon to gro&p a !oe!tion of assertions for a parti!&ar f&n!tion together to
test for the !orre!t behavio&r on %ifferent types of inp&t. 7ere.s an e1ampe for
get_at_content 0hi!h sho0s a range of %ifferent types of behavio&r:
assert get/at/content(,!,& BB "
assert get/at/content(,(,& BB #
assert get/at/content(,!+(),& BB #.$
assert get/at/content(,!((,& BB #.33
assert get/at/content(,!((, "& BB #.3
assert get/at/content(,!((, $& BB #.33333
'n fa!t, this i%ea of gro&ping sets of tests together is s&!h a goo% one that 0e have
spe!ia 0or%s for it :test suites an% unit testing= an% there.s a b&it/in #ython too for
!arrying o&t s&!h tests. $a-e a oo- at the !hapter on mo%&es an% testing in
A"vance" Python for 8iologists for more info.
11H Chapter C: ,riting o&r o0n f&n!tions
(ecap
'n this !hapter, 0e.ve seen ho0 pa!-aging &p !o%e into f&n!tions heps &s to
manage the !ompe1ity of arge programs an% promote !o%e re&se. ,e earne% ho0
to %efine an% !a o&r o0n f&n!tions aong 0ith vario&s ne0 0ays to s&ppy
arg&ments to f&n!tions. ,e aso oo-e% at a !o&pe of things that are possibe in
#ython, b&t rarey a%visabe 5 0riting f&n!tions 0itho&t arg&ments or ret&rn
va&es. Finay, 0e e1pore% the &se of assertions to test o&r f&n!tions, an%
%is!&sse% ho0 0e !an &se them to !at!h errors before they be!ome a probem.
$his !hapter has !overe% the basi!s of 0riting an% &sing f&n!tions, b&t there.s
m&!h more 0e !an %o 0ith them 5 in fa!t, there.s a 0hoe stye of programming
:functional programming= 0hi!h revoves aro&n% the manip&ation of f&n!tions.
3o&. fin% a %is!&ssion of this in the !hapter in A"vance" Python for 8iologists
!ae%, &ns&rprisingy, functional programming.
$he remaining !hapters in this boo- 0i ma-e &se of f&n!tions in both the
e1ampes an% the e1er!ise so&tions, so ma-e s&re yo& are !omfortabe 0ith the
ne0 i%eas from this !hapter before moving on.
11I Chapter C: ,riting o&r o0n f&n!tions
!ercises
Percentage of amino acid residues* part one
,rite a f&n!tion that ta-es t0o arg&ments 5 a protein se<&en!e an% an amino a!i%
resi%&e !o%e 5 an% ret&rns the per!entage of the protein that the amino a!i% ma-es
&p. *se the foo0ing assertions to test yo&r f&n!tion:
assert my/function(,PSQS???Q7??7????PP?P, ,P,& BB $
assert my/function(,PSQS???Q7??7????PP?P, ,r,& BB "#
assert my/function(,PSQS???Q7??7????PP?P, ,?,& BB $#
assert my/function(,PSQS???Q7??7????PP?P, ,R,& BB #
Reminder: if yo&.re &sing #ython 2 rather than #ython 3, in!&%e this ine at the
top of yo&r program:
from //future// import di0ision
Percentage of amino acid residues* part two
Mo%ify the f&n!tion from part one so that it a!!epts a ist of amino a!i% resi%&es
rather than a singe one. 'f no ist is given, the f&n!tion sho&% ret&rn the
per!entage of hy%rophobi! amino a!i% resi%&es :4, ', +, M, F, ,, 3 an% S=. 3o&r
f&n!tion sho&% pass the foo0ing assertions:
assert my/function(,PSQS???Q7??7????PP?P, ',P,*& BB $
assert my/function(,PSQS???Q7??7????PP?P, '3P3 3?3*& BB $$
assert my/function(,PSQS???Q7??7????PP?P, '373 3S3 3?3*& BB 7#
assert my/function(,PSQS???Q7??7????PP?P,& BB %$
11@ Chapter C: ,riting o&r o0n f&n!tions
&olutions
Percentage of amino acid residues* part one
$his is a simiar probem to ones that 0e.ve ta!-e% before, b&t 0e. have to pay
attention to the %etais. +et.s start 0ith a pie!e of !o%e that %oes the !a!&ation for
a spe!ifi! protein se<&en!e an% amino a!i% !o%e, an% then t&rn it into a f&n!tion.
Ca!&ating the per!entage is very simiar to !a!&ating the 4$ !ontent, b&t 0e 0i
nee% to m&tipe the res&t by 100 to get a per!entage rather than a fra!tion:
protein B ,PSQS???Q7??7????PP?P,
aa B ,Q,
aa/count B protein.count(aa&
protein/length B len(protein&
percentage B aa/count K "## / protein/length
print(percentage&
;o0 0e. ma-e this !o%e into a f&n!tion by t&rning the t0o variabes protein an%
aa into arg&ments, an% ret&rning the per!entage rather than printing it. ,e. a%%
in the assertions at the en% of the program to test if the f&n!tion is %oing its 9ob:
def get/aa/percentage(protein aa&:
aa/count B protein.count(aa&
protein/length B len(protein&
percentage B aa/count K "## / protein/length
return percentage
4 test the function 2ith assertions
assert get/aa/percentage(,PSQS???Q7??7????PP?P, ,P,& BB $
assert get/aa/percentage(,PSQS???Q7??7????PP?P, ,r,& BB "#
assert get/aa/percentage(,msrslllrfllfllllpplp, ,?,& BB $#
assert get/aa/percentage(,PSQS???Q7??7????PP?P, ,R,& BB #
120 Chapter C: ,riting o&r o0n f&n!tions
2&nning the !o%e sho0s that one of the assertions is faiing 5 the error message
tes &s 0hi!h assertion is the faie% one:
assert get/aa/percentage(,PSQS???Q7??7????PP?P, ,r,& BB "#
!ssertion9rror
?&r f&n!tion fais to 0or- 0hen the protein se<&en!e is in &pper !ase, b&t the
amino a!i% resi%&e !o%e is in o0er !ase. +oo-ing at the assertions, 0e !an ma-e an
e%&!ate% g&ess that the ne1t one :0ith the protein in o0er !ase an% the amino
a!i% in &pper !ase= is probaby going to fai as 0e. +et.s try to fi1 both of these
probems by !onverting both the protein an% the amino a!i% string to &pper !ase at
the start of the f&n!tion. ,e. &se the same tri!- as 0e %i% before of !onverting a
string to &pper !ase an% then storing the res&t ba!- in the same variabe:
def get/aa/percentage(protein aa&:
4 con0ert both inputs to upper case
protein B protein.upper(&
aa B aa.upper(&
aa/count B protein.count(aa&
protein/length B len(protein&
percentage B aa/count K "## / protein/length
return percentage
;o0 a the assertions pass 0itho&t error.
Percentage of amino acid residues* part two
$his e1er!ise invoves something that 0e.ve not seen before: a f&n!tion that ta-es a
ist as one of its arg&ments. 4s in the previo&s e1er!ise, 0e. pi!- one of the
assertion !ases an% 0rite the !o%e to sove it first, then t&rn the !o%e into a
f&n!tion.
121 Chapter C: ,riting o&r o0n f&n!tions
$here are a!t&ay t0o 0ays to approa!h this probem. ,e !an &se a oop to go
thro&gh ea!h of the given amino a!i% resi%&es in t&rn, !o&nting &p the n&mber of
times they o!!&r in the protein se<&en!e, to get a tota !o&nt. ?r, 0e !an treat the
protein se<&en!e string as a ist :as %es!ribe% in the previo&s !hapter= an% as-, for
ea!h position, 0hether the !hara!ter at that position is a member of the ist of
amino a!i% resi%&es. ,e. &se the first metho% hereD in the ne1t !hapter 0e. earn
abo&t the toos ne!essary to impement the se!on%.
,e. nee% some 0ay to -eep a r&nning tota of mat!hing amino a!i%s as 0e go
ro&n% the oop, so 0e. !reate a ne0 variabe o&tsi%e the oop an% &p%ate it ea!h
time ro&n%. $he !o%e insi%e the oop 0i be <&ite simiar to that from the previo&s
e1er!ise. 7ere.s the !o%e 0ith some print statements so 0e !an see e1a!ty 0hat is
happening:
protein B ,PSQS???Q7??7????PP?P,
aa/list B '3P3 3?3 373*
4 the total 0ariable 2ill hold the total number of matching residues
total B #
for aa in aa/list:
print(,counting number of , C aa&
aa B aa.upper(&
aa/count B protein.count(aa&
4 add the number for this residue to the total count
total B total C aa/count
print(,running total is , C str(total&&
percentage B total K "## / len(protein&
print(,final percentage is , C str(percentage&&
,hen 0e r&n the !o%e, 0e !an see ho0 the r&nning tota in!reases ea!h time ro&n%
the oop:
122 Chapter C: ,riting o&r o0n f&n!tions
counting number of P
running total is "
counting number of ?
running total is ""
counting number of 7
running total is "3
final percentage is %$.#
;o0 et.s ta-e the !o%e an%, 9&st i-e before, t&rn the protein string an% the amino
a!i% ist into arg&ments to !reate a f&n!tion:
def get/aa/percentage(protein aa/list&:
protein B protein.upper(&
protein/length B len(protein&
total B #
for aa in aa/list:
aa B aa.upper(&
aa/count B protein.count(aa&
total B total C aa/count
percentage B total K "## / protein/length
return percentage
$his f&n!tion passes a the assertion tests e1!ept the ast one, 0hi!h tests the
behavio&r 0hen r&n 0ith ony one arg&ment. 'n fa!t, #ython never even gets as far
as testing the res&t from r&nning the f&n!tion, as 0e get an error in%i!ating that
the f&n!tion %i%n.t !ompete:
+ype9rror: get/aa/percentage(& ta:es exactly 2 arguments (" gi0en&
Fi1ing the error ta-es ony one !hange: 0e a%% a %efa&t va&e for aa_list in the
first ine of the f&n!tion %efinition:
123 Chapter C: ,riting o&r o0n f&n!tions
def get/aa/percentage(protein aa/listB'3!33L33?33P33733=33R33S3*&:
protein B protein.upper(&
protein/length B len(protein&
total B #
for aa in aa/list:
aa B aa.upper(&
aa/count B protein.count(aa&
total B total C aa/count
percentage B total K "## / protein/length
return percentage
an% no0 a the assertions pass.
12B Chapter F: Con%itiona tests
&: Conditional tests
Programs nee" to make "ecisions
'f 0e oo- ba!- at the e1ampes an% e1er!ises in previo&s !hapters, something that
stan%s o&t is the a!- of %e!ision/ma-ing. ,e.ve gone from %oing simpe
!a!&ations on in%ivi%&a bits of %ata to !arrying o&t more !ompi!ate% pro!e%&res
on !oe!tions of %ata, b&t the 0ay that ea!h bit of %ata :a se<&en!e, a base, a
spe!ies name, an e1on= has been treate% i%enti!ay.
2ea/ife probems, ho0ever, often re<&ire o&r programs to a!t as %e!ision/ma-ersD
to e1amine a property of some bit of %ata an% %e!i%e 0hat to %o 0ith it. 'n this
!hapter, 0e. see ho0 to %o that &sing con"itional statements. Con%itiona
statements are feat&res of #ython that ao0 &s to b&i% %e!ision points in o&r !o%e.
$hey ao0 o&r programs to %e!i%e 0hi!h o&t of a n&mber of possibe !o&rses of
a!tion to ta-e 5 instr&!tions i-e >print the name of the se;uence if it7s longer than
)33 bases> or >group two samples together if they were collecte" less than 13 metres
apart>.
(efore 0e !an start &sing !on%itiona statements, ho0ever, 0e nee% to &n%erstan%
con"itions.
4on"itions0 'rue an" /alse
4 con"ition is simpy a bit of !o%e that !an pro%&!e a tr&e or fase ans0er. $he
easiest 0ay to &n%erstan% ho0 !on%itions 0or- in #ython is try o&t a fe0 e1ampes.
$he foo0ing e1ampe prints o&t the res&t of testing :or evaluating= a b&n!h of
%ifferent !on%itions 5 some mathemati!a e1ampes, some &sing string metho%s,
an% one for testing if a va&e is in!&%e% in a ist:
12C Chapter F: Con%itiona tests
print(3 BB $&
print(3 - $&
print(3 ;B$&
print(len(,!+(),& - $&
print(,(!!++),.count(,+,& - "&
print(,!+()++,.starts2ith(,!+(,&&
print(,!+()++,.ends2ith(,+++,&&
print(,!+()++,.isupper(&&
print(,!+()++,.islo2er(&&
print(,S, in ',S, ,=, ,?,*&
'f 0e oo- at the o&tp&t, 0e !an see &se the ine n&mbers to mat!h &p ea!h
!on%ition 0ith its res&t:
7alse
7alse
+rue
7alse
+rue
+rue
7alse
+rue
7alse
+rue
(&t 0hat.s a!t&ay being printe% hereE 4t first gan!e, it oo-s i-e 0e.re printing
the strings >$r&e> an% >Fase>, b&t those strings %on.t appear any0here in o&r !o%e.
,hat is a!t&ay being printe% is the spe!ia b&it/in va&es that #ython &ses to
represent tr&e an% fase 5 they are !apitaiGe% so that 0e -no0 they.re these spe!ia
va&es.
,e !an sho0 that these va&es are spe!ia by trying to print them. $he foo0ing
!o%e r&ns 0itho&t errors :note the absen!e of <&otation mar-s=:
print(+rue&
print(7alse&
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
12F Chapter F: Con%itiona tests
0hereas trying to print arbitrary &n<&ote% 0or%s:
print(1ello&
!a&ses a NameError.
$here.s a 0i%e range of things that 0e !an in!&%e in !on%itions, an% it 0o&% be
impossibe to give an e1ha&stive ist here. $he basi! b&i%ing bo!-s are:
• e<&as :represente% by @@=
• greater an% ess than :represente% by E an% H=
• greater an% ess than or e<&a to :represente% by E@ an% H@=
• not e<&a :represente% byI@=
• is a va&e in a ist :represente% by in=
• are t0o ob9e!ts the same
1
:represente% by is=
Many %ata types aso provi%e metho%s that ret&rn $r&e or Fase va&es, 0hi!h are
often a ot more !onvenient to &se than the b&i%ing bo!-s above. ,e.ve area%y
seen a fe0 in the !o%e sampe above: for e1ampe, strings have a starts-ith
metho% that ret&rns tr&e if the string starts 0ith the string given as an arg&ment.
,e. mention these tr&e/fase metho%s 0hen they !ome &p.
;oti!e that the test for e<&aity is two e<&as signs, not one. Forgetting the se!on%
e<&as sign 0i !a&se an error.
;o0 that 0e -no0 ho0 to e1press tests as !on%itions, et.s see 0hat 0e !an %o 0ith
them.
1 4 %is!&ssion of 0hat this a!t&ay means in #ython is beyon% the s!ope of this boo-, so 0e. avoi% &sing
this !omparison for the !hapter.
12H Chapter F: Con%itiona tests
if statements
$he simpest -in% of !on%itiona statement is an if statement. 7opef&y the synta1
is fairy simpe to &n%erstan%:
expression/le0el B "2$
if expression/le0el - "##:
print(,gene is highly expressed,&
,e 0rite the 0or% if, foo0e% by a !on%ition, an% en% the first ine 0ith a !oon.
$here foo0s a bo!- of in%ente% ines of !o%e :the bo"y of the if statement=, 0hi!h
0i ony be e1e!&te% if the !on%ition is tr&e. $his !oon/p&s/bo!- pattern sho&%
be famiiar to yo& from the !hapters on oops an% f&n!tions.
Most of the time, 0e 0ant to &se an if statement to test a property of some variabe
0hose va&e 0e %on.t -no0 at the time 0hen 0e are 0riting the program. $he
e1ampe above is obvio&sy &seess, as the va&e of the expression_leCel
variabe is not going to !hange8
7ere.s a sighty more interesting e1ampe: 0e. %efine a ist of gene a!!ession
names an% print o&t 9&st the ones that start 0ith >a>:
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for accession in accs:
if accession.starts2ith(3a3&:
print(accession&
+oo-ing at the o&tp&t ao0s &s to !he!- that this 0or-s as inten%e%:
ab$%
ayJ3
apJ7
12I Chapter F: Con%itiona tests
'f yo& ta-e a !ose oo- at the !o%e above, yo&. see something interesting 5 the
ines of !o%e insi%e the oop are in%ente% :9&st as 0e.ve seen before=, b&t the ine of
!o%e insi%e the if statement is in%ente% twice 5 on!e for the oop, an% on!e for
the if. $his is the first time 0e.ve seen m&tipe eves of in%entation, b&t it.s very
!ommon on!e 0e start 0or-ing 0ith arger programs 5 0henever 0e have one oop
or if statement neste% insi%e another, 0e. have this type of in%entation.
#ython is <&ite happy to have as many eves of in%entation as nee%e%, b&t yo&.
nee% to -eep !aref& tra!- of 0hi!h ines of !o%e beong at 0hi!h eve. 'f yo& fin%
yo&rsef 0riting a pie!e of !o%e that re<&ires more than three eves of in%entation,
it.s generay an in%i!ation that that pie!e of !o%e sho&% be t&rne% into a f&n!tion.
else statements
Cosey reate% to the if statement is the else statement. $he e1ampes above &se
a yes9no type of %e!ision/ma-ing: sho&% 0e print the gene a!!ession n&mber or
notE ?ften 0e nee% an either9or type of %e!ision, 0here 0e have t0o possibe
a!tions to ta-e. $o %o this, 0e !an a%% on an else !a&se after the en% of the bo%y
of an if statement:
expression/le0el B "2$
if expression/le0el - "##:
print(,gene is highly expressed,&
else:
print(,gene is lo2ly expressed,&
$he else statement %oesn.t have any !on%ition of its o0n 5 rather, the ese
statement bo%y is e1e!&te 0hen the if statement to 0hi!h it.s atta!he% is not
e1e!&te%.
7ere.s an e1ampe 0hi!h &ses if an% else to spit &p a ist of a!!ession names into
t0o %ifferent fies 5 a!!essions that start 0ith >a> go into the first fie, an% a other
a!!essions go into the se!on% fie:
12@ Chapter F: Con%itiona tests
file" B open(,one.txt, ,2,&
file2 B open(,t2o.txt, ,2,&
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for accession in accs:
if accession.starts2ith(3a3&:
file".2rite(accession C ,.n,&
else:
file2.2rite(accession C ,.n,&
;oti!e ho0 there are m&tipe in%entation eves as before, b&t that the if an%
else statements are at the same eve.
elif statements
,hat if 0e have more than t0o possibe bran!hesE For e1ampe, say 0e 0ant three
fies of a!!ession names: ones that start 0ith >a>, ones that start 0ith >b>, an% a
others. ,e !o&% have a se!on% if statement neste% insi%e the else !a&se of the
first if statement:
file" B open(,one.txt, ,2,&
file2 B open(,t2o.txt, ,2,&
file3 B open(,three.txt, ,2,&
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for accession in accs:
if accession.starts2ith(3a3&:
file".2rite(accession C ,.n,&
else:
if accession.starts2ith(3b3&:
file2.2rite(accession C ,.n,&
else:
file3.2rite(accession C ,.n,&
$his 0or-s, b&t is %iffi!&t to rea% 5 0e !an <&i!-y see that 0e nee% an e1tra eve
of in%entation for every a%%itiona !hoi!e 0e 0ant to in!&%e. $o get ro&n% this,
130 Chapter F: Con%itiona tests
#ython has an elif statement, 0hi!h merges together else an% if an% ao0s &s
to re0rite the above e1ampe in a m&!h more eegant 0ay:
file" B open(,one.txt, ,2,&
file2 B open(,t2o.txt, ,2,&
file3 B open(,three.txt, ,2,&
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for accession in accs:
if accession.starts2ith(3a3&:
file".2rite(accession C ,.n,&
elif accession.starts2ith(3b3&:
file2.2rite(accession C ,.n,&
else:
file3.2rite(accession C ,.n,&
;oti!e ho0 this version of the !o%e ony nee%s t0o eves of in%ention. 'n fa!t,
&sing elif 0e !an have any n&mber of bran!hes an% sti ony re<&ire a singe
e1tra eve of in%entation:
for accession in accs:
if accession.starts2ith(3a3&:
file".2rite(accession C ,.n,&
elif accession.starts2ith(3b3&:
file2.2rite(accession C ,.n,&
elif accession.starts2ith(3c3&:
file3.2rite(accession C ,.n,&
elif accession.starts2ith(3d3&:
file4.2rite(accession C ,.n,&
elif accession.starts2ith(3e3&:
file$.2rite(accession C ,.n,&
else:
file%.2rite(accession C ,.n,&
131 Chapter F: Con%itiona tests
while loops
7ere.s one fina thing 0e !an %o 0ith !on%itions: &se them to %etermine 0hen to
e1it a oop. 'n !hapter B 0e earne% abo&t oops that iterate over a !oe!tion of
items :i-e a ist, a string or a fie=. #ython has another type of oop !ae% a -hile
oop. 2ather than r&nning a set n&mber of times, a -hile oop r&ns &nti some
!on%ition is met. For e1ampe, here.s a bit of !o%e that in!rements a !o&nt variabe
by one ea!h time ro&n% the oop, stopping 0hen the !o&nt variabe rea!hes ten:
count B #
2hile count;"#:
print(count&
count B count C "
(e!a&se norma oops in #ython are so po0erf&
1
, -hile oops are &se% m&!h ess
fre<&enty than in other ang&ages, so 0e 0on.t %is!&ss them f&rther.
8uil"ing up comple! con"itions
,hat if 0e 0ante% to e1press a !on%ition that 0as ma%e &p of severa partsE
'magine 0e 0ant to go thro&gh o&r ist of a!!essions an% print o&t ony the ones
that start 0ith >a> an% en% 0ith >3>. ,e !o&% &se t0o neste% if statements:
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for accession in accs:
if accession.starts2ith(3a3&:
if accession.ends2ith(333&:
print(accession&
b&t this brings in an e1tra, &nnee%e% eve of in%ention. 4 better 0ay is to 9oin &p
the t0o !on%ition 0ith and to ma-e a !ompe1 e1pression:
1 ).g. the e1ampe !o%e here !o&% be better a!!ompishe% 0ith a range.
132 Chapter F: Con%itiona tests
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for accession in accs:
if accession.starts2ith(3a3& and accession.ends2ith(333&:
print(accession&
$his version is ni!er in t0o 0ays: it %oesn.t re<&ire the e1tra eve of in%entation,
an% the !on%ition rea%s in a very nat&ra 0ay. ,e !an aso &se or to 9oin &p t0o
!on%itions, to pro%&!e a !ompe1 !on%ition that 0i be tr&e if either of the t0o
simpe !on%itions are tr&e:
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for accession in accs:
if accession.starts2ith(3a3& or accession.starts2ith(3b3&:
print(accession&
,e !an even 9oin &p !ompe1 !on%itions to ma-e more !ompe1 !on%itions 5 here.s
an e1ampe 0hi!h prints a!!essions if they start 0ith either >a> or >b> an% en% 0ith
>B>:
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for acc in accs:
if (acc.starts2ith(3a3& or acc.starts2ith(3b3&& and acc.ends2ith(343&:
print(acc&
;oti!e ho0 0e have to in!&%e parentheses in the above e1ampe to avoi%
ambig&ity. Finay, 0e !an negate any type of !on%ition by prefi1ing it 0ith the
0or% not. $his e1ampe 0i print o&t a!!essions that start 0ith >a> an% don.t en%
0ith F:
accs B '3ab$%3 3bhG43 3h07%3 3ayJ33 3apJ73 3bd723*
for acc in accs:
if acc.starts2ith(3a3& and not acc.ends2ith(3%3&:
print(acc&
133 Chapter F: Con%itiona tests
(y &sing a !ombination of and, or an% not :aong 0ith parentheses 0here
ne!essary= 0e !an b&i% &p arbitrariy !ompe1 !on%itions. $his -in% of &se for
!on%itions 5 i%entifying eements in a ist 5 !an often be %one better &sing either
the fiter f&n!tion, or a ist !omprehension. 3o&. fin% e1ampes of ea!h in the
!hapters on f&n!tiona programming an% !omprehensions respe!tivey in A"vance"
Python for 8iologists.
$hese three 0or%s are !oe!tivey -no0n as boolean operators an% !rop &p in a ot
of pa!es. For e1ampe, if yo& 0ante% to sear!h for information on &sing #ython in
bioogy, b&t %i%n.t 0ant to see pages that ta-e% abo&t bioogy of sna-es, yo& might
%o a sear!h for >biology python Bsnake>. $his is a!t&ay a !ompe1 !on%ition 9&st i-e
the ones above 5 Rooge a&tomati!ay a%%s an" bet0een 0or%s, an% &ses the
hyphen to mean not. "o yo&.re as-ing for pages that mention python an" bioogy
b&t not sna-es.
Writing true9false functions
"ometimes 0e 0ant to 0rite a f&n!tion that !an be &se% in a !on%ition. $his is very
easy to %o 5 0e 9&st ma-e s&re that o&r f&n!tion a0ays ret&rns either $r&e or Fase.
2emember that $r&e an% Fase are b&it/in va&es in #ython, so they !an be passe%
aro&n%, store% in variabes, an% ret&rne%, 9&st i-e n&mbers or strings.
7ere.s a f&n!tion that %etermines 0hether or not a D;4 se<&en!e is 4$/ri!h :0e.
say that a se<&en!e is 4$/ri!h if it has an 4$ !ontent of more than 0.FC=:
def is/at/rich(dna&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
if at/content - #.%$:
return +rue
else:
return 7alse
13B Chapter F: Con%itiona tests
,e. test this f&n!tion on a fe0 se<&en!es to see if it 0or-s:
print(is/at/rich(,!++!+)+!)+!,&&
print(is/at/rich(,)(()!()()+,&&
$he o&tp&t sho0s that the f&n!tion ret&rns $r&e or Fase 9&st i-e the other
!on%itions 0e.ve been oo-ing at:
+rue
7alse
$herefore 0e !an &se o&r f&n!tion in an if statement:
if is/at/rich(my/dna&:
4 do something 2ith the seAuence
(e!a&se the ast fo&r ines of o&r f&n!tion are %evote% to eva&ating a !on%ition
an% ret&rning $r&e or Fase, 0e !an 0rite a sighty more !ompa!t version. 'n this
e1ampe 0e eva&ate the !on%ition, an% then ret&rn the res&t right a0ay:
def is/at/rich(dna&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return at/content - #.%$
$his is a itte more !on!ise, an% aso easier to rea% on!e yo&.re famiiar 0ith the
i%iom.
13C Chapter F: Con%itiona tests
(ecap
'n this short !hapter, 0e.ve %eat 0ith t0o things: !on%itions, an% the statements
that &se them.
,e.ve seen ho0 simpe !on%itions !an be 9oine% together to ma-e more !ompe1
ones, an% ho0 the !on!epts of tr&th an% fasehoo% are b&it in to #ython on a
f&n%amenta eve. ,e.ve aso seen ho0 0e !an in!orporate $r&e an% Fase in o&r
o0n f&n!tions in a 0ay that ao0s them to be &se% as part of !on%itions.
,e.ve been intro%&!e% to fo&r %ifferent toos that &se !on%itions 5 if, else, elif,
an% -hile 5 in appro1imate or%er of &sef&ness. 3o&. probaby fin%, in the
programs that yo& 0rite an% in yo&r so&tions to the e1er!ises in this boo-, that
yo& &se if an% else very fre<&enty, elif o!!asionay, an% -hile amost never.
13F Chapter F: Con%itiona tests
!ercises
'n the chapter?- fo%er in the e1er!ises %o0noa%, yo&. fin% a te1t fie !ae%
"ata=csv, !ontaining some ma%e/&p %ata for a n&mber of genes. )a!h ine !ontains
the foo0ing fie%s for a singe gene in this or%er: spe!ies name, se<&en!e, gene
name, e1pression eve. $he fie%s are separate% by !ommas :hen!e the name of the
fie 5 !sv stan%s for Comma "eparate% Sa&es=. $hin- of it as a representation of a
tabe in a sprea%sheet 5 ea!h ine is a ro0, an% ea!h fie% in a ine is a !o&mn. 4
the e1er!ises for this !hapter &se the %ata rea% from this fie.
Reminder: if yo&.re &sing #ython 2 rather than #ython 3, in!&%e this ine at the
top of yo&r programs:
from //future// import di0ision
3everal species
#rint o&t the gene names for a genes beonging to 1rosophila melanogaster or
1rosophila simulans.
"engt, range
#rint o&t the gene names for a genes bet0een @0 an% 110 bases ong.
5T content
#rint o&t the gene names for a genes 0hose 4$ !ontent is ess than 0.C an% 0hose
e1pression eve is greater than 200.
13H Chapter F: Con%itiona tests
Complex condition
#rint o&t the gene names for a genes 0hose name begins 0ith >-> or >h> e1!ept
those beonging to 1rosophila melanogaster.
8ig, low medium
For ea!h gene, print o&t a message giving the gene name an% saying 0hether its 4$
!ontent is high :greater than 0.FC=, o0 :ess than 0.BC= or me%i&m :bet0een 0.BC
an% 0.FC=.
13I Chapter F: Con%itiona tests
&olutions
3everal species
$hese e1er!ises are some0hat more !ompi!ate% than previo&s ones, an% they.re
going to re<&ire materia from m&tipe %ifferent !hapters to sove. $he first
probem is to %ea 0ith the format of the %ata fie. ?pen it &p in a te1t e%itor an%
ta-e a oo- before !ontin&ing.
,e -no0 that 0e.re going to have to open the fie :!hapter 3= an% pro!ess the
!ontents ine/by/ine :!hapter B=. $o %ea 0ith ea!h ine, 0e. have to spit it to
ma-e a ist of !o&mns :!hapter B=, then appy the !on%ition :this !hapter= in or%er
to fig&re o&t 0hether or not 0e sho&% print it. 7ere.s a program that 0i rea% ea!h
ine from the fie, spit it &sing !ommas as the %eimiter, then assign ea!h of the
fo&r !o&mns to a variabe an% print the gene name:
data B open(,data.cs0,&
for line in data:
columns B line.rstrip(,.n,&.split(,,&
species B columns'#*
seAuence B columns'"*
name B columns'2*
expression B columns'3*
print(name&
;oti!e that 0e &se rstrip to remove the ne0ine from the en% of the !&rrent ine
before spitting it. ,e -no0 the or%er of the fie%s in the ine be!a&se they 0ere
mentione% in the e1er!ise %es!ription, so 0e !an easiy assign them to the fo&r
variabes. $his program %oesn.t %o anything &sef&, b&t 0e !an !he!- the o&tp&t to
!onfirm that it gets the names right:
13@ Chapter F: Con%itiona tests
:dy%47
Ddg7%%
:dy$33
hdt73J
hdu#4$
teg43%
;o0 0e !an a%% in the !on%ition. ,e 0ant to print the name if the spe!ies is eit,er
1rosophila melanogaster or 1rosophila simulans. 'f the spe!ies name is neit,er of
those t0o, then 0e %on.t 0ant to %o anything. $his is a yes9no type %e!ision, so 0e
nee% an if statement:
data B open(,data.cs0,&
for line in data:
columns B line.rstrip(,.n,&.split(,,&
species B columns'#*
seAuence B columns'"*
name B columns'2*
expression B columns'3*
if species BB ,@rosophila melanogaster, or species BB ,@rosophila
simulans,:
print(name&
$he ine !ontaining the if statement is <&ite ong, so it 0raps aro&n% onto the ne1t
ine on this page, b&t it.s sti 9&st a singe ine in the program fie. ,e !an !he!- the
o&tp&t 0e get:
:dy%47
Ddg7%%
:dy$33
against the !ontents of the fie, an% !onfirm that the program is 0or-ing.
1B0 Chapter F: Con%itiona tests
"engt, range
,e !an re/&se a arge part of the !o%e from the previo&s e1er!ise to hep sove this
one. ,e have another !ompe1 !on%ition: 0e ony 0ant to print names for genes
0hose ength is bet0een @0 an% 110 bases 5 in other 0or%s, genes 0hose ength is
greater than @0 and ess than 110. ,e. have to !a!&ate the ength &sing the len
f&n!tion. ?n!e 0e.ve %one that the rest of the program is <&ite straightfor0ar%:
data B open(,data.cs0,&
for line in data:
columns B line.rstrip(,.n,&.split(,,&
species B columns'#*
seAuence B columns'"*
name B columns'2*
expression B columns'3*
if len(seAuence& - J# and len(seAuence& ; ""#:
print(name&
5T content
$his e1er!ise has a !ompe1 !on%ition i-e the others, b&t it aso re<&ires &s to %o a
bit more !a!&ation 5 0e nee% to be abe to !a!&ate the 4$ !ontent of ea!h
se<&en!e. 2ather than starting from s!rat!h, 0e. simpy &se the f&n!tion that 0e
0rote in the previo&s !hapter an% in!&%e it at the start of the program. ?n!e 0e.ve
%one that, it.s 9&st a !ase of &sing the o&tp&t from get_at_content as part of
the !on%ition. ,e m&st be !aref& to !onvert the fo&rth !o&mn 5 the e1pression
eve 5 into an integer so that it !an be propery !ompare%:
1B1 Chapter F: Con%itiona tests
4 our function to get !+ content
def get/at/content(dna&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return at/content

data B open(,data.cs0,&
for line in data:
columns B line.rstrip(,.n,&.split(,,&
species B columns'#*
seAuence B columns'"*
name B columns'2*
expression B int(columns'3*&
if get/at/content(seAuence& ; #.$ and expression - 2##:
print(name&
Complex condition
$here are no !a!&ations to !arry o&t for this e1er!ise 5 the !ompe1ity !omes from
the fa!t that there are three !omponents to the !on%ition, an% they have to be
9oine% together in the right 0ay:
data B open(,data.cs0,&
for line in data:
columns B line.rstrip(,.n,&.split(,,&
species B columns'#*
seAuence B columns'"*
name B columns'2*
expression B columns'3*
if (name.starts2ith(3:3& or name.starts2ith(3h3&& and species 5B
,@rosophila melanogaster,:
print(name&
$he ine !ontaining the if statement is <&ite ong, so it 0raps aro&n% onto the ne1t
ine on this page, b&t it.s sti 9&st a singe ine in the program fie. $here are t0o
1B2 Chapter F: Con%itiona tests
%ifferent 0ays to e1press the re<&irement that the name is not 1rosophila
melanogaster= 'n the above e1ampe 0e.ve &se% the not/e<&as sign :8Z= b&t 0e
!o&% aso have &se% the not booean operator:
if (name.starts2ith(3:3& or name.starts2ith(3h3&& and not species BB
,@rosophila melanogaster,:
8ig, low medium
;o0 0e !ome to an e1er!ise that re<&ires the &se of m&tipe bran!hes. ,e have
three %ifferent printing options for ea!h gene 5 high, o0 an% me%i&m 5 so 0e.
nee% an if..elif..else se!tion to han%e the !on%itions. ,e. &se the
get_at_content f&n!tion as before:
4 our function to get !+ content
def get/at/content(dna&:
length B len(dna&
a/count B dna.upper(&.count(3!3&
t/count B dna.upper(&.count(3+3&
at/content B (a/count C t/count& / length
return at/content

data B open(,data.cs0,&
for line in data:
columns B line.rstrip(,.n,&.split(,,&
species B columns'#*
seAuence B columns'"*
name B columns'2*
expression B columns'3*
if get/at/content(seAuence& - #.%$:
print(name C , has high !+ content,&
elif get/at/content(seAuence& ; #.4$:
print(name C , has lo2 !+ content,&
else:
print(name C , has medium !+ content,&
1B3 Chapter F: Con%itiona tests
Che!-ing the o&tp&t !onfirms that the !on%itions are 0or-ing:
:dy%47 has high !+ content
Ddg7%% has medium !+ content
:dy$33 has medium !+ content
hdt73J has lo2 !+ content
hdu#4$ has medium !+ content
teg43% has medium !+ content
$his genera type of probem is very !ommon in programming. $here.s a simiar
e1er!ise at the en% of the !hapter on f&n!tiona programming in A"vance" Python
for 8iologists 0hi!h i&strates a %ifferent approa!h to soving them.
1BB Chapter H: 2eg&ar e1pressions
#: Regular expressions
'he importance of patterns in biology
4 ot of 0hat 0e %o 0hen 0riting programs for bioogy !an be %es!ribe% as
sear!hing for patterns in strings. $he obvio&s e1ampes !ome from the anaysis of
bioogi!a se<&en!e %ata 5 remember that D;4, 2;4 an% protein se<&en!es are
9&st strings. Many of the things 0e 0ant to oo- for in bioogi!a se<&en!es !an be
%es!ribe% in terms of patterns:
• protein %omains
• D;4 trans!ription fa!tor bin%ing motifs
• restri!tion enGyme !&t sites
• %egenerate #C2 primer sites
• r&ns of monon&!eoti%es
7o0ever, it.s not 9&st se<&en!e %ata that !an have interesting patterns. 4s 0e
%is!&sse% in !hapter 3, most of the other types of %ata 0e have to %ea 0ith in
bioogy !omes in the form of strings
1
insi%e te1t fies 5 things i-e:
• rea% mapping o!ations
• geographi!a sampe !oor%inates
• ta1onomi! names
• gene names
• gene a!!ession n&mbers
• (+4"$ sear!hes
1 ;ote that atho&gh many of the things in this ist are n&meri!a %ata, they.re sti rea% in to #ython
programs as strings an% nee% to be manip&ate% as s&!h.
1BC Chapter H: 2eg&ar e1pressions
'n previo&s !hapters, 0e.ve oo-e% at some programming tas-s that invove pattern
re!ognition in strings. ,e.ve seen ho0 to !o&nt in%ivi%&a amino a!i% resi%&es :an%
even gro&ps of amino a!i% resi%&es= in protein se<&en!es :!hapter C=, an% ho0 to
i%entify restri!tion enGyme !&t sites in D;4 se<&en!es :!hapter 2=. ,e.ve aso seen
ho0 to e1amine parts of gene names an% mat!h them against in%ivi%&a !hara!ters
:!hapter F=.
$he !ommon theme among a these probems is that they invove sear!hing for a
fixed set of !hara!ters. (&t there are many probems that 0e 0ant to sove that
re<&ire more fe1ibe patterns. For e1ampe:
• Riven a D;4 se<&en!e, 0hat.s the ength of the poy/4 taiE
• Riven a gene a!!ession name, e1tra!t the part bet0een the thir% !hara!ter
an% the &n%ers!ore
• Riven a protein se<&en!e, %etermine if it !ontains this highy/re%&n%ant
%omain motif
(e!a&se these types of probems !rop &p in so many %ifferent fie%s, there.s a
stan%ar% set of toos in #ython
1
for %eaing 0ith them: regular e!pressions= 2eg&ar
e1pressions
2
are a topi! that might not be !overe% in a genera/p&rpose
programming boo-, b&t be!a&se they.re so &sef& in bioogy, 0e.re going to %evote
the 0hoe of this !hapter to oo-ing at them.
4tho&gh the toos for %eaing 0ith reg&ar e1pressions are b&it in to #ython, they
are not ma%e a&tomati!ay avaiabe 0hen yo& 0rite a program. 'n or%er to &se
them 0e m&st first ta- abo&t mo%&es.
1 4n% in many other ang&ages an% &tiities.
2 $he name is often abbreviate% to rege!.
1BF Chapter H: 2eg&ar e1pressions
2o"ules in Python
$he f&n!tions an% %ata types that 0e.ve %is!&sse% so far in this boo- have been
ones that are i-ey to be nee%e% in pretty m&!h every program 5 toos for %eaing
0ith strings an% n&mbers, for rea%ing an% 0riting fies, an% for manip&ating ists
of %ata. 4s s&!h, they are a&tomati!ay ma%e avaiabe 0hen 0e start to !reate a
#ython program. 'f 0e 0ant to open a fie, 0e simpy 0rite a statement that &ses
the open f&n!tion.
7o0ever, there.s another !ategory of toos in #ython 0hi!h are more spe!iaiGe%.
2eg&ar e1pressions are one e1ampe, b&t there is a arge ist of spe!iaiGe% toos
0hi!h are very &sef& 0hen yo& nee% them
1
, b&t are not i-ey to be nee%e% for the
ma9ority of programs. )1ampes in!&%e toos for %oing a%van!e% mathemati!a
!a!&ations, for %o0noa%ing %ata from the 0eb, for r&nning e1terna programs,
an% for manip&ating %ate/time information. )a!h !oe!tion of spe!iaiGe% toos 5
reay 9&st a !oe!tion of spe!iaiGe% functions an% %ata types 5 is !ae% a mo"ule.
For reasons of effi!ien!y, #ython %oesn.t a&tomati!ay ma-e these mo%&es
avaiabe in ea!h ne0 program, as it %oes 0ith the more basi! toos. 'nstea%, 0e
have to e1pi!ity oa% ea!h mo%&e of spe!iaiGe% toos that 0e 0ant to &se insi%e
o&r program. $o oa% a mo%&e 0e &se the import statement
2
. For e1ampe, the
mo%&e that %eas 0ith reg&ar e1pressions is !ae% re, so if 0e 0ant to 0rite a
program that &ses the reg&ar e1pression toos 0e m&st in!&%e the ine:
import re
at the top of o&r program. ,hen 0e then 0ant to &se one of the toos from a
mo%&e, 0e have to prefi1 it 0ith the mo%&e name
3
. For e1ampe, to &se the
1 'n%ee%, this is one of the great strengths of the #ython ang&age.
2 $his is the reason for the from __future__ import diCision statement that 0e have to in!&%e if
0e.re &sing #ython 2.
3 $here are 0ays ro&n% this, b&t 0e 0on.t !onsi%er them in this boo-.
1BH Chapter H: 2eg&ar e1pressions
reg&ar e1pression search f&n!tion :0hi!h 0e. %is!&ss ater in this !hapter= 0e
have to 0rite:
re.search(pattern string&
rather than simpy:
search(pattern string&
'f 0e forget to import the mo%&e 0hi!h 0e 0ant to &se, or forget to in!&%e the
mo%&e name as part of the f&n!tion !a, 0e 0i get a NameError.
,e. en!o&nter vario&s other mo%&e in the rest of this boo-. For the rest of this
!hapter spe!ifi!ay, a !o%e e1ampes 0i re<&ire the import re statement in
or%er to 0or-. For !arity, 0e 0on.t in!&%e it, so if yo& 0ant try r&nning any of the
!o%e in this !hapter, yo&. nee% to a%% it at the top.
For ots more on mo%&es, in!&%ing ho0 to !reate yo&r o0n, ta-e a oo- at the
mo%&es !hapter in A"vance" Python for 8iologists.
(aw strings
,riting reg&ar e1pression patterns, as 0e. see in the very ne1t se!tion of this
!hapter, re<&ires &s to type a ot of spe!ia !hara!ters. 2e!a from !hapter 2 that
!ertain !ombinations of !hara!ters are interprete% by #ython to have spe!ia
meaning. For e1ampe, \n means start a new line, an% \t means insert a tab
character.
*nfort&natey, there are a imite% n&mber of spe!ia !hara!ters to go ro&n%, so
some of the !hara!ters that have a spe!ia meaning in reg&ar e1pressions !ash
0ith the !hara!ters that alread+ have a spe!ia meaning. #ython.s 0ay ro&n% this
probem is to have a spe!ia r&e for strings: if 0e p&t the etter r imme%iatey
1BI Chapter H: 2eg&ar e1pressions
before the opening <&otation mar-, then any spe!ia !hara!ters insi%e the string
are ignore%:
print(r,.t.n,&
$he r stan%s for raw, 0hi!h is #ython.s %es!ription for a string 0here spe!ia
!hara!ters are ignore%. ;oti!e that the r goes outside the <&otation mar-s 5 it is
not part of the string itsef. ,e !an see from the o&tp&t that the above !o%e prints
o&t the string 9&st as 0e.ve 0ritten it:
.t.n
0itho&t any tabs or ne0 ines. 3o&. see this spe!ia raw notation &se% in a the
reg&ar e1pression !o%e e1ampes in this !hapter.
&earching for a pattern in a string
,e. start off 0ith the simpest reg&ar e1pression too. re.search is a tr&e/fase
f&n!tion that %etermines 0hether or not a pattern appears some0here in a string.
't ta-es t0o arg&ments, both strings. $he first arg&ment is the pattern that yo&
0ant to sear!h for, an% the se!on% arg&ment is the string that yo& 0ant to sear!h
in. For e1ampe, here.s ho0 0e test if a D;4 se<&en!e !ontains an )!o2' restri!tion
site:
dna B ,!+)()(!!++)!),
if re.search(r,(!!++), dna&:
print(,restriction site found5,&
;oti!e that 0e.ve &se% the ra0 notation for the pattern, even tho&gh it.s not stri!ty
ne!essary as it %oesn.t !ontain any spe!ia !hara!ters.
1B@ Chapter H: 2eg&ar e1pressions
5lternation
$he above e1ampe isn.t parti!&ary interesting, as the restri!tion motif has no
variation. +et.s try it 0ith the 4va'' motif, 0hi!h !&ts at t0o %ifferent motifs:
RR4CC an% RR$CC. ,e !an &se the te!hni<&es 0e earne% in the previo&s !hapter
to ma-e a !ompe1 !on%ition &sing or:
dna B ,!+)()(!!++)!),
if re.search(r,((!)), dna& or re.search(r,((+)), dna&:
print(,restriction site found5,&
(&t a better 0ay is to !apt&re the variation in the 4va'' site &sing a reg&ar
e1pression:
dna B ,!+)()(!!++)!),
if re.search(r,(((!T+&)), dna&:
print(,restriction site found5,&
7ere 0e.re &sing the aternation feat&re of reg&ar e1pressions. 'nsi%e parentheses,
0e 0rite the aternatives separate% by a pipe !hara!ter, so +!J", means either A or
'. $his ets &s 0rite a singe pattern 5 ##+!J",CC 5 0hi!h !apt&res the variation
in the motif.
C,aracter groups
$he (is' restri!tion enGyme !&ts at an even 0i%er range of motifs 5 the pattern is
RC;RC, 0here ; represents any base. ,e !an &se the same aternation te!hni<&e
to sear!h for this pattern:
dna B ,!+)()(!!++)!),
if re.search(r,()(!T+T(T)&(), dna&:
print(,restriction site found5,&
1C0 Chapter H: 2eg&ar e1pressions
7o0ever, there.s another reg&ar e1pression feat&re that ets &s 0rite the pattern
more !on!isey. 4 pair of s<&are bra!-ets 0ith a ist of !hara!ters insi%e them !an
represent any one of these !hara!ters. "o the pattern #C.!"#C2#C 0i mat!h
#C!#C, #C"#C, #C##C an% #CC#C. 7ere.s the same program &sing !hara!ter
gro&ps:
dna B ,!+)()(!!++)!),
if re.search(r,()'!+()*(), dna&:
print(,restriction site found5,&
'f 0e 0ant a !hara!ter in a pattern to mat!h an+ !hara!ter in the inp&t, 0e !an &se
a perio% 5 the pattern #C.#C 0o&% mat!h a fo&r possibiities. 7o0ever, the
perio% 0o&% aso mat!h any !hara!ter 0hi!h is not a D;4 base, or even a etter.
$herefore, the 0hoe pattern 0o&% aso mat!h #C7#C, #CK#C an% #C?#C, 0hi!h
may not be 0hat 0e 0ant.
"ometimes it.s easier, rather than isting a the a!!eptabe !hara!ters, to spe!ify
the !hara!ters that 0e don.t 0ant to mat!h. #&tting a !aret U at the start of a
!hara!ter gro&p i-e this MUJ3XN 0i negate it, an% mat!h any !hara!ter that isn.t in
the gro&p.
;uantifiers
$he reg&ar e1pression feat&res %is!&sse% above et &s %es!ribe variation in the
in%ivi%&a !hara!ters of patterns. 4nother gro&p of feat&res, ;uantifiers, et &s
%es!ribe variation in the n&mber of times a se!tion of a pattern is repeate%.
4 <&estion mar- imme%iatey foo0ing a !hara!ter means that that !hara!ter is
optiona 5 it !an mat!h either <ero or one times. "o in the pattern #!"LC the " is
optiona, an% the pattern 0i mat!h either #!"C or #!C. 'f 0e 0ant to appy a
<&estion mar- to more than one !hara!ter, 0e !an gro&p the !hara!ters in
1C1 Chapter H: 2eg&ar e1pressions
parentheses. For e1ampe, in the pattern RRR:444=E$$$ the gro&p of three !s is
optiona, so the pattern 0i mat!h either ###!!!""" or ###""".
4 p&s sign imme%iatey foo0ing a !hara!ter or gro&p means that the !hara!ter or
gro&p must be present b&t !an be repeate% any n&mber of times 5 in other 0or%s,
it 0i mat!h one or more times. For e1ampe, the pattern ###!F""" 0i mat!h
three #s, foo0e% by one or more !s, foo0e% by three "s. "o it 0i mat!h
###!""", ###!!"", ###!!!"", et!. b&t not ###""".
4n asteris- imme%iatey foo0ing a !hara!ter or gro&p means that the !hara!ter or
gro&p is optiona, b&t !an aso be repeate%. 'n other 0or%s, it 0i mat!h <ero or
more times. For e1ampe, the pattern ###!4""" 0i mat!h three #s, foo0e% by
Gero or more !s, foo0e% by three "s. "o it 0i mat!h ###""", ###!""",
###!!""", et!.
'f 0e 0ant to spe!ify a spe!ifi! n&mber of repeats, 0e !an &se !&ry bra!-ets.
Foo0ing a !hara!ter or gro&p 0ith a single n&mber insi%e !&ry bra!-ets 0i
mat!h e1a!ty that n&mber of repeats. For e1ampe, the pattern #!M1N" 0i mat!h
#!!!!!" b&t not #!!!!" or #!!!!!!". Foo0ing a !hara!ter or gro&p 0ith a
pair of numbers insi%e !&ry bra!-ets separate% 0ith a !omma ao0s &s to spe!ify
an a!!eptabe range of n&mber of repeats. For e1ampe, the pattern #!M6%8N" 0i
mat!h #!!", #!!!" an% #!!!!" b&t not #!" or #!!!!!".
Positions
$he fina set of reg&ar e1pression toos 0e.re going to oo- at %on.t represent
!hara!ters at a, b&t rather positions in the inp&t string. $he !aret symbo O
mat!hes the start of a string, an% the %oar symbo P mat!hes the end of a string.
$he pattern O!!! 0i mat!h !!!""" b&t not ###!!!""". $he pattern ###P 0i
mat!h !!!### b&t not !!!###CCC.
1C2 Chapter H: 2eg&ar e1pressions
Combining
$he rea po0er of reg&ar e1pressions !omes from !ombining these toos. ,e !an
&se <&antifiers together 0ith aternations an% !hara!ter gro&ps to spe!ify very
fe1ibe patterns. For e1ampe, here.s a !ompe1 pattern to i%entify f&/ength
e&-aryoti! messenger 2;4 se<&en!es:
8!+('!+()*U3#"###V!U$"#V6
2ea%ing the pattern from eft to right, it 0i mat!h:
• an 4$R start !o%on at the beginning of the se<&en!e
• foo0e% by bet0een 30 an% 1000 bases 0hi!h !an be 4, $, R or C
• foo0e% by a poy/4 tai of bet0een C an% 10 bases at the en% of the
se<&en!e
4s yo& !an see, reg&ar e1pressions !an be <&ite tri!-y to rea% &nti yo&.re famiiar
0ith them8 7o0ever, it.s 0e 0orth investing a bit of time earning to &se them, as
the same notation is &se% a!ross m&tipe %ifferent toos. $he reg&ar e1pression
s-is that yo& earn in #ython are transferabe to other programming ang&ages,
!omman% ine toos, an% te1t e%itors.
$he feat&res 0e.ve %is!&sse% above are the ones most &sef& in bioogy, an% are
s&ffi!ient to ta!-e a the e1er!ises at the en% of the !hapter. 7o0ever, there are
many more reg&ar e1pression feat&res avaiabe in #ython. 'f yo& 0ant to be!ome
a reg&ar e1pression master, it.s 0orth rea%ing &p on gree"y vs= minimal ;uantifiers,
backBreferences, lookahea" an% lookbehin" assertions, an% builtBin character classes.
(efore 0e move on to oo- at some more sophisti!ate% &ses of reg&ar e1pressions,
it.s 0orth noting that there.s a metho% simiar to re.search !ae% re.match.
$he %ifferen!e is that re.search 0i i%entify a pattern o!!&rring an+w,ere in
the string, 0hereas re.match 0i ony i%entify a pattern if it mat!hes the entire
string. Most of the time 0e 0ant the former behavio&r.
1C3 Chapter H: 2eg&ar e1pressions
!tracting the part of the string that matche"
'n the se!tion above 0e &se% re.search as the !on%ition in an if statement to
%e!i%e 0hether or not a string !ontaine% a pattern. ?ften in o&r programs, 0e 0ant
to fin% o&t not ony if a pattern mat!he%, b&t w,at part of the string 0as mat!he%.
$o %o this, 0e nee% to store the res&t of &sing re.search, then &se the group
metho% on the res&ting ob9e!t.
,hen intro%&!ing the re.search f&n!tion above ' 0rote that it 0as a tr&e/fase
f&n!tion. $hat.s not e!actly !orre!t tho&gh 5 if it fin%s a mat!h, it %oesn.t ret&rn
"rue, b&t rather an ob9e!t that is eva&ate% as tr&e in a !on%itiona !onte1t
1
:if the
%istin!tion %oesn.t seem important to yo&, then yo& !an safey ignore it=. $he va&e
that.s a!t&ay ret&rne% is a mat!h ob9e!t 5 a ne0 %ata type that 0e.ve not
en!o&ntere% before. +i-e a fie ob9e!t :see !hapter 3=, a mat!h ob9e!t %oesn.t
represent a simpe thing, i-e a n&mber or string. 'nstea%, it represents the res&ts
of a reg&ar e1pression sear!h. 4n% again, 9&st i-e a fie ob9e!t, a mat!h ob9e!t has a
n&mber of &sef& metho%s for getting %ata o&t of it.
?ne s&!h metho% is the group metho%. 'f 0e !a this metho% on the res&t of a
reg&ar e1pression sear!h, 0e get the portion of the inp&t string that mat!he% the
pattern:
dna B ,!+(!)(+!)(+!)(!)+(,
4 store the match obDect in the 0ariable m
m B re.search(r,(!'!+()*U3V!), dna&
print(m.group(&&
'n the above !o%e, 0e.re sear!hing insi%e a D;4 se<&en!e for #!, foo0e% by three
bases, foo0e% by !C. (y !aing the group metho% on the res&ting mat!h ob9e!t,
1 'f a mat!h isn.t fo&n%, then the same thing appiesD the f&n!tion %oesn.t ret&rn 7alse, b&t a %ifferent
b&it/in va&e 5 None 5 that eva&ates as fase. 'f this %oesn.t ma-e sense, %on.t 0orry.
1CB Chapter H: 2eg&ar e1pressions
0e !an see the part of the D;4 se<&en!e that mat!he%, an% fig&re o&t 0hat the
mi%%e three bases 0ere:
(!)(+!)
,hat if 0e 0ant to e1tra!t more than one bit of the patternE "ay 0e 0ant to mat!h
this pattern:
(!'!+()*U3V!)'!+()*U2V!)
$hat.s #!, foo0e% by three bases, foo0e% by !C, foo0e% by t0o bases, foo0e%
by !C again. ,e !an s&rro&n% the bits of the pattern that 0e 0ant to e1tra!t 0ith
parentheses 5 this is !ae% capturing it:
(!('!+()*U3V&!)('!+()*U2V&!)
,e !an no0 refer to the !apt&re% bits of the pattern by s&ppying an arg&ment to
the group metho%. group+3, 0i ret&rn the bit of the string mat!he% by the
se!tion of the pattern in the first set of parentheses, group+6, 0i ret&rn the bit
mat!he% by the se!on%, et!.:
dna B ,!+(!)(+!)(+!)(!)+(,
4 store the match obDect in the 0ariable m
m B re.search(r,(!('!+()*U3V&!)('!+()*U2V&!), dna&
print(,entire match: , C m.group(&&
print(,first bit: , C m.group("&&
print(,second bit: , C m.group(2&&
$he o&tp&t sho0s that the three bases in the first variabe se!tion 0ere C#", an%
the t0o bases in the se!on% variabe se!tion 0ere #":
1CC Chapter H: 2eg&ar e1pressions
entire match: (!)(+!)(+!)
first bit: )(+
second bit: (+
$etting the position of a match
4s 0e as !ontaining information abo&t the contents of a mat!h, the mat!h ob9e!t
aso ho%s information abo&t the position of the mat!h. $he start an% end
metho%s get the positions of the start an% en% of the pattern on the se<&en!e:
dna B ,!+(!)(+!)(+!)(!)+(,
m B re.search(r,(!('!+()*U3V&!)('!+()*U2V&!), dna&
print(,start: , C str(m.start(&&&
print(,end: , C str(m.end(&&&
2emember that 0e start !o&nting from Gero, so in this !ase, the mat!h starting at
the thir% base has a start position of t0o:
start: 2
end: "3
,e !an get the start an% en% positions of in%ivi%&a gro&ps by s&ppying a n&mber
as the arg&ment to start an% end:
dna B ,!+(!)(+!)(+!)(!)+(,
m B re.search(r,(!('!+()*U3V&!)('!+()*U2V&!), dna&
print(,start: , C str(m.start(&&&
print(,end: , C str(m.end(&&&
print(,group one start: , C str(m.start("&&&
print(,group one end: , C str(m.end("&&&
print(,group t2o start: , C str(m.start(2&&&
print(,group t2o end: , C str(m.end(2&&&
1CF Chapter H: 2eg&ar e1pressions
'n this parti!&ar !ase, 0e !o&% fig&re o&t the start an% en% positions of the
in%ivi%&a gro&ps from the start an% en% positions of the 0hoe pattern:
start: 2
end: "3
group one start: 4
group one end: 7
group t2o start: J
group t2o end: ""
b&t that might not a0ays be possibe for patterns that have variabe ength
repeats.
&plitting a string using a regular e!pression
?!!asionay it !an be &sef& to spit a string &sing a reg&ar e1pression pattern as
the %eimiter. $he norma string split metho% %oesn.t ao0 this, b&t the re
mo%&e has a split f&n!tion of its o0n that ta-es a reg&ar e1pression pattern as
an arg&ment. $he first arg&ment is the pattern, the se!on% arg&ment is the string
to be spit.
'magine 0e have a !onsens&s D;4 se<&en!e that !ontains ambig&ity !o%es, an% 0e
0ant to e1tra!t a r&ns of !ontig&o&s &nambig&o&s bases. ,e nee% to spit the
D;4 string 0herever 0e see a base that isn.t 4, $, R or C:
dna B ,!)+<()!+Q()+!)(+R!)(!+S)(!=+)(,
runs B re.split(r,'8!+()*, dna&
print(runs&
2e!a that p&tting a !aret U at the start of a !hara!ter gro&p negates it. $he o&tp&t
sho0s ho0 the f&n!tion 0or-s 5 the ret&rn va&e is a ist of strings:
'3!)+3 3()!+3 3()+!)(+3 3!)(!+3 3)(!3 3+)(3*
1CH Chapter H: 2eg&ar e1pressions
/in"ing multiple matches
$he e1ampes 0e.ve seen so far %ea 0ith !ases 0here 0e.re ony intereste% in a
singe o!!&rren!e of a pattern in a string. 'f instea% 0e 0ant to fin% every pa!e
0here a pattern o!!&rs in a string, there are t0o f&n!tions in the re mo%&e to hep
&s.
re.findall ret&rns a ist of a mat!hes of a pattern in a string. $he first
arg&ment is the pattern, an% the se!on% arg&ment is the string. "ay 0e 0ant to fin%
a r&ns of ! an% " in a D;4 se<&en!e onger than five bases:
dna B ,!)+()!++!+!+)(+!)(!!!++!+!)()()(,
runs B re.findall(r,'!+*U4"##V, dna&
print(runs&
;oti!e that the ret&rn va&e of the findall metho% is not a mat!h ob9e!t 5 it is a
straightfor0ar% ist of strings:
'3!++!+!+3 3!!!++!+!3*
so 0e have no 0ay to e1tra!t the positions. 'f 0e 0ant to %o anything more
!ompi!ate% than simpy e1tra!ting the te1t of the mat!hes, 0e nee% to &se the
re.finditer metho%. finditer ret&rns a se<&en!e of mat!h ob9e!ts, so to %o
anything &sef& 0ith it, 0e nee% to &se the ret&rn va&e in a oop:
dna B ,!)+()!++!+!+)(+!)(!!!++!+!)()()(,
runs B re.finditer(r,'!+*U3"##V, dna&
for match in runs:
run/start B match.start(&
run/end B match.end(&
print(,!+ rich region from , C str(run/start& C , to , C str(run/end&&
1CI Chapter H: 2eg&ar e1pressions
4s 0e !an see from the o&tp&t:
!+ rich region from $ to "2
!+ rich region from "G to 2%
finditer gives &s !onsi%eraby more fe1ibiity that findall.
(ecap
J&st as in the previo&s !hapter, 0e earne% abo&t t0o %istin!t !on!epts :!on%itions,
an% the statements that &se them= in this !hapter 0e earne% abo&t regular
e!pressions, an% the functions that &se them.
,e starte% 0ith a brief intro%&!tion to t0o !on!epts that, 0hie not part of the
reg&ar e1pression toos, are ne!essary in or%er to &se them 5 ibraries an% ra0
strings. ,e got a far/from/!ompete overvie0 of feat&res that !an be &se% in reg&ar
e1pression patterns, an% a <&i!- oo- at the range of %ifferent things 0e !an %o
0ith them. J&st as reg&ar e1pressions themseves !an range from simpe to
!ompe1, so !an their &ses. ,e !an &se reg&ar e1pressions for simpe tas-s i-e
%etermining 0hether or not a se<&en!e !ontains a parti!&ar motif, or for
!ompi!ate% ones i-e i%entifying messenger 2;4 se<&en!es by &sing !ompe1
patterns.
(efore 0e move on to the e1er!ises, it.s important to re!ogniGe that for any given
pattern, there are probaby m&tipe 0ays to %es!ribe it &sing a reg&ar e1pression.
;ear the start of the !hapter, 0e !ame &p 0ith the pattern ##+!J",CC to %es!ribe
the 4va'' restri!tion enGyme re!ognition site, b&t it !o&% aso be 0ritten as
• ##.!"2CC%
• +##!CCJ##"CC,
• +##!J##",CC
• #M6N.!"2CM6N
1C@ Chapter H: 2eg&ar e1pressions
4s 0ith other sit&ations 0here there are m&tipe %ifferent 0ays to 0rite the same
thing, it.s best to be g&i%e% by 0hat is !earest to rea%.
1F0 Chapter H: 2eg&ar e1pressions
!ercises
5ccession names
7ere.s a ist of ma%e/&p gene a!!ession names:
1-nC@B3I, yh%!-2, eih%3@%@, !h%syeIBH, he%e3BCC, 19h%C3e, BC%a, %e3H%p
,rite a program that 0i print ony the a!!ession names that satisfy the foo0ing
!riteria 5 treat ea!h !riterion separatey:
• !ontain the n&mber C
• !ontain the etter % or e
• !ontain the etters % an% e in that or%er
• !ontain the etters % an% e in that or%er 0ith a singe etter bet0een them
• !ontain both the etters % an% e in any or%er
• start 0ith 1 or y
• start 0ith 1 or y an% en% 0ith e
• !ontain three or more n&mbers in a ro0
• en% 0ith % foo0e% by either a, r or p
(ouble digest
'n the chapter?# fie insi%e the e1er!ises %o0noa%, there.s a fie !ae% "na=t!t
0hi!h !ontains a ma%e/&p D;4 se<&en!e. #re%i!t the fragment engths that 0e 0i
get if 0e %igest the se<&en!e 0ith t0o ma%e/&p restri!tion enGymes 5 4b!', 0hose
re!ognition site is 4;$W44$, an% 4b!'', 0hose re!ognition site is RC2,W$R
:asteris-s in%i!ate the position of the !&t site=.
1F1 Chapter H: 2eg&ar e1pressions
&olutions
5ccession names
?bvio&sy, the b&- of the 0or- here is going to be !oming &p 0ith the reg&ar
e1pression patterns to see!t ea!h s&bset of the a!!ession names. 7ere.s the easy
bit 5 storing the a!!ession names in a ist an% then pro!essing them in a oop :the
first ine 0raps ro&n% be!a&se it.s too ong to fit on the page=:
accs B ',x:n$J43G, ,yhdc:2, ,eihd3JdJ, ,chdsyeG47, ,hedle34$$,
,xDhd$3e, ,4$da, ,de37dp,*
for acc in accs:
4 print if it passes the test
;o0 0e !an ta!-e the %ifferent !riteria one by one. For ea!h e1ampe, the !o%e
:bor%ere% by soi% ines= is foo0e% imme%iatey by the o&tp&t :bor%ere% by %otte%
ines=.
$he first !riterion is straightfor0ar% 5 a!!essions that !ontain the n&mber C. ,e
%on.t even have to &se any fan!y reg&ar e1pression feat&res:
for acc in accs:
if re.search(r,$, acc&:
print(,.t, C acc&
x:n$J43G
hedle34$$
xDhd$3e
4$da
;o0 for a!!essions that !ontain the etters % or e. ,e !an &se either aternation or a
!hara!ter gro&p. 7ere.s a so&tion &sing aternation:
1F2 Chapter H: 2eg&ar e1pressions
for acc in accs:
if re.search(r,(dTe&, acc&:
print(,.t, C acc&
yhdc:2
eihd3JdJ
chdsyeG47
hedle34$$
xDhd$3e
4$da
de37dp
$he ne1t one 5 a!!essions that !ontain both the etters % an% e, in that or%er 5 is a
bit more tri!-y. ,e !an.t 9&st &se a simpe aternation or a !hara!ter gro&p, be!a&se
they mat!h an+ of their !onstit&ent parts, an% 0e nee% bot, % an% e. ?ne 0ay to
thin- of the pattern is %, foo0e% by some other etters an% n&mbers, foo0e% by e.
,e have to be !aref& 0ith o&r <&antifiers, ho0ever 5 at first gan!e the pattern
d.Fe oo-s goo%, b&t it 0i fai to mat!h the a!!ession 0here e foo0s % %ire!ty.
$o ao0 for the fa!t that % might be imme%iatey foo0e% by e, 0e nee% to &se the
asteris-:
for acc in accs:
if re.search(r,d.Ke, acc&:
print(,.t, C acc&
chdsyeG47
hedle34$$
xDhd$3e
de37dp
$he ne1t re<&irement 5 %, foo0e% by a singe etter, foo0e% by e 5 is a!t&ay
easier to 0rite a pattern for, even tho&gh it so&n%s more !ompi!ate%. ,e simpy
remove the asteris-, an% the perio% 0i no0 mat!h any singe !hara!ter:
1F3 Chapter H: 2eg&ar e1pressions
for acc in accs:
if re.search(r,(d.e&, acc&:
print(,.t, C acc&
hedle34$$
$he ne1t re<&irement 5 % an% e in any or%er 5 is more %iffi!&t. ,e !o&% %o it 0ith
an aternation &sing the pattern +d.4eJe.4d,, 0hi!h transates as " then e0 or e
then ". 'n this !ase, ' thin- it.s !earer to !arry o&t t0o separate reg&ar e1pression
sear!hes an% !ombine them into a !ompe1 !on%ition:
for acc in accs:
if re.search(r,d.Ke, acc& or re.search(r,e.Kd, acc&:
print(,.t, C acc&
hedle34$$
de37dp
$o fin% a!!essions that start 0ith either 1 or y, 0e nee% to !ombine an aternation
0ith a start/of/string an!hor:
for acc in accs:
if re.search(r,8(xTy&, acc&:
print(,.t, C acc&
x:n$J43G
yhdc:2
xDhd$3e
1FB Chapter H: 2eg&ar e1pressions
,e !an mo%ify this <&ite easiy to a%% the re<&irement that the a!!ession en%s
0ith e. 4s before, 0e nee% to &se .W in the mi%%e to mat!h any n&mber of any
!hara!ter, res&ting in <&ite a !ompe1 pattern:
for acc in accs:
if re.search(r,8(xTy&.Ke6, acc&:
print(,.t, C acc&
xDhd$3e
$o mat!h three or more n&mbers in a ro0, 0e nee% a more spe!ifi! <&antifier 5 the
!&ry bra!-ets 5 an% a !hara!ter gro&p 0hi!h !ontains a the n&mbers:
for acc in accs:
if re.search(r,'#"234$%7GJ*U3"##V, acc&:
print(,.t, C acc&
x:n$J43G
chdsyeG47
hedle34$$
,e !an a!t&ay ma-e this a bit more !on!ise. $he !hara!ter gro&p of a %igits is
s&!h a !ommon one that there.s a b&it/in shorthan% for it: \d. ,e !an aso ta-e
a%vantage of a shorthan% in the !&ry bra!-et <&antifier 5 if 0e eave off the &pper
bo&n%, then it mat!hes 0ith no &pper imit. $he more !on!ise version:
for acc in accs:
if re.search(r,.dU3V, acc&:
print(,.t, C acc&
1FC Chapter H: 2eg&ar e1pressions
x:n$J43G
chdsyeG47
hedle34$$
$he fina re<&irement is <&ite simpe an% ony re<&ires a !hara!ter gro&p an% an
en%/of/string an!hor to sove:
for acc in accs:
if re.search(r,d'arp*6, acc&:
print(,.t, C acc&
4$da
de37dp
(ouble digest
$his is a har% probem, an% there are severa 0ays to approa!h it. +et.s simpify it by
first fig&ring o&t 0hat the fragment engths 0o&% be if 0e %igeste% the se<&en!e
0ith 9&st a singe restri!tion enGyme
1
. ,e. open an% rea% the fie a in one go
:there.s no nee% to pro!ess it ine/by/ine as it.s 9&st a singe se<&en!e=, then 0e.
&se re.finditer to fig&re o&t the positions of a the !&t sites.
$he patterns themseves are reativey simpe: ; means any base, so the pattern for
the 4b!' site is !.!"#C2"!!". $he ambig&ity !o%e Q means ! or # an% the !o%e R
means ! or ", so the pattern for 4b!'' is #C.!#2.!"2"#. 7ere.s the !o%e to
!a!&ate the start positions of the mat!hes for 4b!':
1 For the p&rposes of this e1er!ise, 0e are of !o&rse ignoring a the interesting !hemi!a -ineti!s of
restri!tion enGymes an% ass&ming that a enGymes !&t 0ith !ompete spe!ifi!ity an% effi!ien!y.
1FF Chapter H: 2eg&ar e1pressions
import re
dna B open(,dna.txt,&.read(&.rstrip(,.n,&
print(,!bcL cuts at:,&
for match in re.finditer(r,!'!+()*+!!+, dna&:
print(match.start(&&
$he o&tp&t from this oo-s goo%:
!bcL cuts at:
""4#
"%2$
b&t it.s not <&ite right 5 it.s teing &s the positions of the start of ea!h mat!h, b&t
the enGyme a!t&ay !&ts 3 base pairs &pstream of the start. $o get the position of
the !&t site, 0e nee% to a%% three to the start of ea!h mat!h:
import re
dna B open(,dna.txt,&.read(&.rstrip(,.n,&
print(,!bcL cuts at:,&
for match in re.finditer(r,!'!+()*+!!+, dna&:
print(match.start(& C 3&
!bcL cuts at:
""43
"%2G
;o0 0e.ve got the !&t positions, ho0 are 0e going to 0or- o&t the fragment siGesE
?ne 0ay is to go thro&gh ea!h !&t site in or%er an% meas&re the %istan!e bet0een
it an% the previo&s one 5 that 0i give &s the ength of a singe fragment. $o ma-e
this 0or- 0e. have to a%% >imaginary> !&t sites at the very start an% en% of the
se<&en!e:
1FH Chapter H: 2eg&ar e1pressions
import re
dna B open(,dna.txt,&.read(&.rstrip(,.n,&
all/cuts B '#*
for match in re.finditer(r,!'!+()*+!!+, dna&:
all/cuts.append(match.start(& C 3&
all/cuts.append(len(dna&&
print(all/cuts&
+et.s ta-e a moment to e1amine 0hat.s going on in this program. ,e start by
!reating a ne0 ist variabe !ae% all_cuts to ho% the !&t positions :ine 3=. 4t
this point, the all_cuts variabe ony has one eement: Gero, the position of the
start of the se<&en!e. ;e1t, for ea!h mat!h to the pattern :ine B=, 0e ta-e the start
position, a%% three to it to get the !&t position, an% appen% that n&mber to the
all_cuts ist :ine C=. Finay, 0e appen% the position of the ast !hara!ter in the
D;4 string to the all_cuts ist :ine F=. ,hen 0e print the all_cuts ist, 0e !an
see that it !ontains the position of the start an% en% of the string, an% the interna
positions of the !&t sites:
'# ""43 "%2G 2#"2*
;o0 0e !an 0rite a se!on% oop to go thro&gh the aY!&ts ist an%, for ea!h !&t
position, 0or- o&t the siGe of the fragment that 0i be !reate% by fig&ring o&t the
%istan!e to the previo&s !&t site :i.e. the previo&s eement in the ist=. $o ma-e this
0or-, ho0ever, 0e !an.t 9&st &se a norma oop 5 0e have to start at the se!on%
eement of the ist :be!a&se the first eement has no previo&s eement= an% 0e have
to 0or- 0ith the in%e1 of ea!h eement, rather than the eement itsef. ,e. &se the
range f&n!tion to generate the ist of in%e1es that 0e 0ant to pro!ess 5 0e nee% to
go from in%e1 1 :i.e. the se!on% eement of the ist= to the ast in%e1 :0hi!h is the
ength of the ist=:
1
2
3
4
5
6
7
1FI Chapter H: 2eg&ar e1pressions
for i in range("len(all/cuts&&:
this/cut/position B all/cuts'i*
pre0ious/cut/position B all/cuts'iF"*
fragment/siEe B this/cut/position F pre0ious/cut/position
print(,one fragment siEe is , C str(fragment/siEe&&
$he oop variabe i is &se% to store ea!h va&e that is generate% by the range
f&n!tion :ine 1=. For ea!h va&e of i 0e get the !&t position at that in%e1 :ine 2=
an% the !&t position at the previo&s in%e1 :ine 3= an% then fig&re o&t the %istan!e
bet0een them :ine B=. $he o&tp&t sho0s ho0, for t0o !&ts, 0e get three
fragments:
one fragment siEe is ""43
one fragment siEe is 4G$
one fragment siEe is 3G4
;o0 for the fina part of the so&tion: ho0 %o 0e %o the same thing for t0o
%ifferent enGymesE ,e !an a%% in the se!on% enGyme pattern 0ith the appropriate
!&t site offset an% appen% the !&t positions to the all_cuts variabe:
import re
dna B open(,dna.txt,&.read(&.rstrip(,.n,&
all/cuts B '#*

4 add cut positions for !bcL
for match in re.finditer(r,!'!+()*+!!+, dna&:
all/cuts.append(match.start(& C 3&

4 add cut positions for !bcLL
for match in re.finditer(r,()'!(*'!+*+(, dna&:
all/cuts.append(match.start(& C 4&

4 add the final position
all/cuts.append(len(dna&&
print(all/cuts&
1
2
3
4
5
1F@ Chapter H: 2eg&ar e1pressions
b&t oo- 0hat happens 0hen 0e print the eements of aY!&ts:
'# ""43 "%2G 4GG "$77 2#"2*
,e get Gero, then the t0o !&t positions for the first enGyme in as!en%ing or%er,
then the t0o !&t positions for the se!on% enGyme in as!en%ing or%er, then the
position of the en% of the se<&en!e. $he metho% for t&rning a ist of !&t positions
into fragment siGes that 0e %eveope% above isn.t going to 0or- 0ith this ist,
be!a&se it reies on the ist of positions being in as!en%ing or%er. 'f 0e try it 0ith
the ist of !&t positions pro%&!e% by the above !o%e, 0e. en% &p 0ith obvio&sy
in!orre!t fragment siGes:
one fragment siEe is ""43
one fragment siEe is 4G$
one fragment siEe is F""4#
one fragment siEe is "#GJ
one fragment siEe is 434
7appiy, #ython.s b&it/in sort f&n!tion !an !ome to the res!&e. 4 0e nee% to %o
is sort the ist of !&t positions before pro!essing it, an% 0e get the right ans0ers.
7ere.s the !ompete, fina !o%e:
1H0 Chapter H: 2eg&ar e1pressions
import re
dna B open(,dna.txt,&.read(&.rstrip(,.n,&
print(str(len(dna&&&
all/cuts B '#*

4 add cut positions for !bcL
for match in re.finditer(r,!'!+()*+!!+, dna&:
all/cuts.append(match.start(& C 3&

4 add cut positions for !bcLL
for match in re.finditer(r,()'!(*'!+*+(, dna&:
all/cuts.append(match.start(& C 4&

4 add the final position
all/cuts.append(len(dna&&
sorted/cuts B sorted(all/cuts&
print(sorted/cuts&


for i in range("len(sorted/cuts&&:
this/cut/position B sorted/cuts'i*
pre0ious/cut/position B sorted/cuts'iF"*
fragment/siEe B this/cut/position F pre0ious/cut/position
print(,one fragment siEe is , C str(fragment/siEe&&
1H1 Chapter I: Di!tionaries
': (ictionaries
&toring paire" "ata
"&ppose 0e 0ant to !o&nt the n&mber of 4s in a D;4 se<&en!e. Carrying o&t the
!a!&ation is <&ite straightfor0ar% 5 in fa!t it.s one of the first things 0e %i% in
!hapter 2:
dna B ,!+)(!+)(!+)(+!)()+(!,
a/count B dna.count(,!,&
7o0 0i o&r !o%e !hange if 0e 0ant to generate a !ompete ist of base !o&nts for
the se<&en!eE ,e. a%% a ne0 variabe for ea!h base:
dna B ,!+)(!+)(!+)(+!)()+(!,
a/count B dna.count(,!,&
t/count B dna.count(,+,&
g/count B dna.count(,(,&
c/count B dna.count(,),&
an% no0 o&r !o%e is starting to oo- rather repetitive. 't.s not too ba% for the fo&r
in%ivi%&a bases, b&t 0hat if 0e 0ant to generate !o&nts for the 1F %in&!eoti%es:
dna B ,!+)(!+)(!+)(+!)()+(!,
aa/count B dna.count(,!!,&
at/count B dna.count(,!+,&
ag/count B dna.count(,!(,&
...etc. etc.
or the FB trin&!eoti%es:
1H2 Chapter I: Di!tionaries
dna B ,!+)(!+)(!+)(+!)()+(!,
aaa/count B dna.count(,!!!,&
aat/count B dna.count(,!!+,&
aag/count B dna.count(,!!(,&
...etc. etc.
For trin&!eoti%es an% onger, the sit&ation is parti!&ary ba%. $he D;4 se<&en!e
is 20 bases ong, so it ony !ontains 1I overapping trin&!eoti%es in tota. $his
means that 0e. en% &p 0ith FB %ifferent variabes, at east BF of 0hi!h 0i ho%
the va&e Gero.
?ne possibe 0ay ro&n% this is to store the va&es in a ist. 'f 0e &se three neste%
oops, 0e !an generate a possibe trin&!eoti%es, !a!&ate the !o&nt for ea!h one,
an% store a the !o&nts in a ist:
dna B ,!!+(!+)(!+)(+!)()+(!,
all/counts B '*
for base" in '3!3 3+3 3(3 3)3*:
for base2 in '3!3 3+3 3(3 3)3*:
for base3 in '3!3 3+3 3(3 3)3*:
trinucleotide B base" C base2 C base3
count B dna.count(trinucleotide&
print(,count is , C str(count& C , for , C trinucleotide&
all/counts.append(count&
print(all/counts&
1H3 Chapter I: Di!tionaries
4tho&gh the !o%e is above is <&ite !ompa!t, an% %oesn.t re<&ire h&ge n&mbers of
variabes, the o&tp&t sho0s t0o probems 0ith this approa!h:
count is # for !!!
count is " for !!+
count is # for !!(
count is # for !!)
count is # for !+!
count is # for !++
count is " for !+(
count is 2 for !+)
W many lines remo0ed W
'# " # # # # " 2 # # # # # # " # # # # " # # # #
2 # # # # # 2 # # 2 # # " # # # # # # # # " # # #
# # # # # " # " " # " # # # #*
Firsty, the %ata are sti in!re%iby sparse 5 the vast ma9ority of the !o&nts are Gero.
"e!on%y, the !o&nts themseves are no0 %is!onne!te% from the trin&!eoti%es. 'f
0e 0ant to oo- &p the !o&nt for a singe trin&!eoti%e 5 for e1ampe, $R4 5 0e
first have to fig&re o&t that $R4 0as the 2C
th
trin&!eoti%e generate% by o&r oops.
?ny then !an 0e get the eement at the !orre!t in%e1:
print(,count for +(! is , C str(all/counts'24*&&
1HB Chapter I: Di!tionaries
,e !an try vario&s tri!-s to get ro&n% this probem. ,hat if 0e generate% t0o ists
5 one of !o&nts, an% one of the trin&!eoti%es themsevesE
dna B ,!!+(!+)(!+)(+!)()+(!,
all/trinucleotides B '*
all/counts B '*
for base" in '3!3 3+3 3(3 3)3*:
for base2 in '3!3 3+3 3(3 3)3*:
for base3 in '3!3 3+3 3(3 3)3*:
trinucleotide B base" C base2 C base3
count B dna.count(trinucleotide&
all/trinucleotides.append(trinucleotide&
all/counts.append(count&
print(all/counts&
print(all/trinucleotides&
;o0 0e have t0o ists of the same ength, 0ith a one/to/one !orrespon%en!e
bet0een the eements:
'# " # # # # " 2 # # # # # # " # # # # " # # # #
2 # # # # # 2 # # 2 # # " # # # # # # # # " # # #
# # # # # " # " " # " # # # #*
'3!!!3 3!!+3 3!!(3 3!!)3 3!+!3 3!++3 3!+(3 3!+)3 3!(!3 3!(+3
3!((3 3!()3 3!)!3 3!)+3 3!)(3 3!))3 3+!!3 3+!+3 3+!(3 3+!)3
3++!3 3+++3 3++(3 3++)3 3+(!3 3+(+3 3+((3 3+()3 3+)!3 3+)+3
3+)(3 3+))3 3(!!3 3(!+3 3(!(3 3(!)3 3(+!3 3(++3 3(+(3 3(+)3
3((!3 3((+3 3(((3 3(()3 3()!3 3()+3 3()(3 3())3 3)!!3 3)!+3
3)!(3 3)!)3 3)+!3 3)++3 3)+(3 3)+)3 3)(!3 3)(+3 3)((3 3)()3
3))!3 3))+3 3))(3 3)))3*
$his ao0s &s to oo- &p the !o&nt for a given trin&!eoti%e in a sighty more
appeaing 0ay 5 0e !an oo- &p the in%e1 of the trin&!eoti%e in the
all_trinucleotides ist, then get the !o&nt at the same in%e1 in the
all_counts ist:
1HC Chapter I: Di!tionaries
i B all/trinucleotides.index(3+(!3&
c B all/counts'i*
print(3count for +(! is 3 C str(c&&
$his is a itte bit ni!er, b&t sti has ma9or %ra0ba!-s. ,e.re sti storing a those
Geros, an% no0 0e have t0o ists to -eep tra!- of. ,e nee% to be in!re%iby !aref&
0hen manip&ating either of the t0o ists to ma-e s&re that they stay perfe!ty
syn!hroniGe% 5 if 0e ma-e any !hange to one ist b&t not the other, then there 0i
no onger be a one/to/one !orrespon%en!e bet0een eements an% 0e. get the
0rong ans0er 0hen 0e try to oo- &p a !o&nt.
$his approa!h is aso so0
1
. $o fin% the in%e1 of a given trin&!eoti%e in the
all_trinucleotides ist, #ython has to oo- at ea!h eement one at a time &nti
it fin%s the one 0e.re oo-ing for. $his means that as the siGe of the ist gro0s
2
, the
time ta-en to oo- &p the !o&nt for a given eement 0i gro0 aongsi%e it.
'f 0e ta-e a step ba!- an% thin- abo&t the probem in more genera terms, 0hat 0e
nee% is a 0ay of storing pairs of %ata :in this !ase, trin&!eoti%es an% their !o&nts=
in a 0ay that ao0s &s to effi!ienty oo- &p the !o&nt for any given trin&!eoti%e.
$his probem of storing paire% %ata is in!re%iby !ommon in programming. ,e
might 0ant to store:
• protein se<&en!e names an% their se<&en!es
• D;4 restri!tion enGyme names an% their motifs
• !o%ons an% their asso!iate% amino a!i% resi%&es
• !oeag&es. names an% their emai a%%resses
• sampe names an% their !o/or%inates
• 0or%s an% their %efinitions
1 4s a r&e, 0e.ve avoi%e% ta-ing abo&t performan!e in this boo-, b&t 0e. brea- the r&e in this !ase.
2 For instan!e, imagine !arrying o&t the same e1er!ise 0ith the appro1imatey one miion &ni<&e 10/mers.
1HF Chapter I: Di!tionaries
4 these are e1ampes of 0hat 0e !a keyBvalue pairs. 'n ea!h !ase 0e have pairs of
keys an% values:
=e+ >alue
trin&!eoti%e !o&nt
name protein se<&en!e
name restri!tion enGyme motif
!o%on amino a!i% resi%&e
name emai a%%ress
sampe !oor%inates
0or% %efinition
$he ast e1ampe in this tabe 5 0or%s an% their %efinitions 5 is an interesting one
be!a&se 0e have a too in the physi!a 0or% for storing this type of %ata 5 a
%i!tionary. #ython.s too for soving this type of probem is aso !ae% a %i!tionary
:&s&ay abbreviate% to "ict= an% in this !hapter 0e. see ho0 to !reate an% &se
them.
4reating a "ictionary
$he synta1 for !reating a %i!tionary is simiar to that for !reating a ist, b&t 0e &se
!&ry bra!-ets rather than s<&are ones. )a!h pair of %ata, !onsisting of a -ey an% a
va&e, is !ae% an item. ,hen storing items in a %i!tionary, 0e separate them 0ith
!ommas. ,ithin an in%ivi%&a item, 0e separate the -ey an% the va&e 0ith a !oon.
7ere.s a bit of !o%e that !reates a %i!tionary of restri!tion enGymes :&sing %ata
from the previo&s !hapter= 0ith three items:
enEymes B U 39coQL3:r3(!!++)3 3!0aLL3:r3(((!T+&))3 3NisL3:3()'!+()*()3 V
1HH Chapter I: Di!tionaries
'n this !ase, the -eys an% va&es are both strings
1
. "pitting the %i!tionary
%efinition over severa ines ma-es it easier to rea%:
enEymes B U
39coQL3 : r3(!!++)3
3!0aLL3 : r3(((!T+&))3
3NisL3 : r3()'!+()*()3
V
b&t %oesn.t affe!t the !o%e at a. $o retrieve a bit of %ata from the %i!tionary 5 i.e.
to oo- &p the motif for a parti!&ar enGyme 5 0e 0rite the name of the %i!tionary,
foo0e% by the -ey in s<&are bra!-ets:
print(enEymes'3NisL3*&
$he !o%e oo-s very simiar to &sing a ist, b&t instea% of giving the in%e1 of the
eement 0e 0ant, 0e.re giving the key for the value that 0e 0ant to retrieve.
Di!tionaries are a very &sef& 0ay to store %ata, b&t they !ome 0ith some
restri!tions. $he ony types of %ata 0e are ao0e% to &se as -eys are strings an%
n&mbers
2
, so 0e !an.t, for e1ampe, !reate a %i!tionary 0here the -eys are fie
ob9e!ts. Sa&es !an be 0hatever type of %ata 0e i-e. 4so, -eys must be &ni<&e 5
0e !an.t store m&tipe va&es for the same -ey. 3o& might thin- that this ma-es
%i!ts ess &sef&, b&t there are 0ays ro&n% the probem of storing m&tipe va&es 5
0e 0on.t nee% them for the e1ampes in this !hapter, b&t the !hapter on !ompe1
%ata str&!t&res in A"vance" Python for 8iologists gives %etais.
1 $he va&es are a!t&ay ra0 strings, b&t that.s not important.
2 ;ot stri!ty tr&eD 0e !an &se any imm&tabe type, b&t that is beyon% the s!ope of this boo-.
1HI Chapter I: Di!tionaries
'n rea/ife programs, it.s reativey rare that 0e. 0ant to !reate a %i!tionary a in
one go i-e in the e1ampe above. More often, 0e. 0ant to !reate an empty
%i!tionary, then a%% -ey/va&e pairs to it :9&st as 0e often !reate an empty ist an%
then a%% eements to it=.
$o !reate an empty %i!tionary 0e simpy 0rite a pair of !&ry bra!-ets on their o0n,
an% to a%% eements, 0e &se the s<&are/bra!-ets notation on the eft/han% si%e of
an assignment. 7ere.s a bit of !o%e that stores the restri!tion enGyme %ata one item
at a time:
enEymes B UV
enEymes'39coQL3* B r3(!!++)3
enEymes'3!0aLL* B r3(((!T+&))3
enEymes'3NisL3* B r3()'!+()*()3
,e !an %eete a -ey from a %i!tionary &sing the pop metho%. pop a!t&ay ret&rns
the va&e an% %eetes the -ey at the same time:
enEymes B U
39coQL3 : r3(!!++)3
3!0aLL3 : r3(((!T+&))3
3NisL3 : r3()'!+()*()3
V
4 remo0e the 9coQL enEyme from the dict
enEymes.pop(39coQL3&
+et.s ta-e another oo- at the trin&!eoti%e !o&nt e1ampe from the start of the
!hapter. 7ere.s ho0 0e store the trin&!eoti%es an% their !o&nts in a %i!tionary:
1H@ Chapter I: Di!tionaries
dna B ,!!+(!+)(!+)(+!)()+(!,
counts B UV
for base" in '3!3 3+3 3(3 3)3*:
for base2 in '3!3 3+3 3(3 3)3*:
for base3 in '3!3 3+3 3(3 3)3*:
trinucleotide B base" C base2 C base3
count B dna.count(trinucleotide&
counts'trinucleotide* B count
print(counts&
,e !an see from the o&tp&t that the trin&!eoti%es an% their !o&nts are store%
together in one variabe:
U3!))3: # 3!+(3: " 3!!(3: # 3!!!3: # 3!+)3: 2 3!!)3: # 3!+!3: #
3!((3: # 3))+3: # 3)+)3: # 3!()3: # 3!)!3: # 3!(!3: # 3)!+3: #
3!!+3: " 3!++3: # 3)+(3: " 3)+!3: # 3!)+3: # 3)!)3: # 3!)(3: "
3)!!3: # 3!(+3: # 3)!(3: # 3))(3: # 3)))3: # 3)++3: # 3+!+3: #
3((+3: # 3+(+3: # 3)(!3: " 3))!3: # 3+)+3: # 3(!+3: 2 3)((3: #
3+++3: # 3+()3: # 3(((3: # 3+!(3: # 3((!3: # 3+!!3: # 3(()3: #
3+!)3: " 3++)3: # 3+)(3: 2 3++!3: # 3++(3: # 3+))3: # 3(!!3: #
3+((3: # 3()!3: # 3(+!3: " 3())3: # 3(+)3: # 3()(3: # 3(+(3: #
3(!(3: # 3(++3: # 3()+3: " 3+(!3: 2 3(!)3: # 3)(+3: " 3+)!3: #
3)()3: "V
,e sti have a ot of repetitive !o&nts of Gero, b&t oo-ing &p the !o&nt for a
parti!&ar trin&!eoti%e is no0 very straightfor0ar%:
print(counts'3+(!3*&
,e no onger have to 0orry abo&t either >memoriGing> the or%er of the !o&nts or
maintaining t0o separate ists.
+et.s no0 see if 0e !an fin% a 0ay of avoi%ing storing a those Gero !o&nts. ,e !an
a%% an if statement that ens&res that 0e ony store a !o&nt if it.s greater than
Gero:
1I0 Chapter I: Di!tionaries
dna B ,!!+(!+)(!+)(+!)()+(!,
counts B UV
for base" in '3!3 3+3 3(3 3)3*:
for base2 in '3!3 3+3 3(3 3)3*:
for base3 in '3!3 3+3 3(3 3)3*:
trinucleotide B base" C base2 C base3
count B dna.count(trinucleotide&
if count - #:
counts'trinucleotide* B count
print(counts&
,hen 0e oo- at the o&tp&t from the above !o%e, 0e !an see that the amo&nt of
%ata 0e.re storing is m&!h smaer 5 9&st the !o&nts for the trin&!eoti%es that are
greater than Gero:
U3!+(3: " 3!)(3: " 3!+)3: 2 3(+!3: " 3)+(3: " 3)()3: " 3(!+3: 2
3)(!3: " 3!!+3: " 3+(!3: 2 3()+3: " 3+!)3: " 3+)(3: 2 3)(+3: "V
;o0 0e have a ne0 probem to %ea 0ith. +oo-ing &p the !o&nt for a given
trin&!eoti%e 0or-s fine 0hen the !o&nt is positive:
print(counts'3+(!3*&
(&t 0hen the !o&nt is Gero, the trin&!eoti%e %oesn.t appear as a -ey in the
%i!tionary:
print(counts'3!!!3*&
so 0e 0i get a Key)rror 0hen 0e try to oo- it &p:
Iey9rror: 3!!!3
1I1 Chapter I: Di!tionaries
$here are t0o possibe 0ays to fi1 this. ,e !an !he!- for the e1isten!e of a -ey in a
%i!tionary :9&st i-e 0e !an !he!- for the e1isten!e of an eement in a ist=, an% ony
try to retrieve it on!e 0e -no0 it e1ists:
if 3!!!3 in counts:
print(counts(3!!!3&&
4ternativey, 0e !an &se the %i!tionary.s get metho%. get &s&ay 0or-s 9&st i-e
&sing s<&are bra!-ets: the foo0ing t0o ines %o e1a!ty the same thing:
print(counts'3+(!3*&
print(counts.get(3+(!3&&
$he thing that ma-es get reay &sef&, ho0ever, is that it !an ta-e an optiona
se!on% arg&ment, 0hi!h is the %efa&t va&e to be ret&rne% if the -ey isn.t present
in the %i!tionary. 'n this !ase, 0e -no0 that if a given trin&!eoti%e %oesn.t appear
in the %i!tionary then its !o&nt is Gero, so 0e !an give Gero as the %efa&t va&e an%
&se get to print o&t the !o&nt for any trin&!eoti%e:
print(,count for +(! is , C str(counts.get(3+(!3 #&&&
print(,count for !!! is , C str(counts.get(3!!!3 #&&&
print(,count for (+! is , C str(counts.get(3(+!3 #&&&
print(,count for +++ is , C str(counts.get(3+++3 #&&&
4s 0e !an see from the o&tp&t, 0e no0 %on.t have to 0orry abo&t 0hether or not
ea!h trin&!eoti%e appears in the %i!tionary 5 get ta-es !are of everything an%
ret&rns Gero 0hen appropriate:
count for +(! is 2
count for !!! is #
count for (+! is "
count for +++ is #
1I2 Chapter I: Di!tionaries
5terating over a "ictionary
,hat if, instea% of oo-ing &p a singe item from a %i!tionary, 0e 0ant to %o
something for a itemsE For e1ampe, imagine that 0e 0ante% to ta-e o&r !o&nts
%i!tionary variabe from the !o%e above an% print o&t a trin&!eoti%es 0here the
!o&nt 0as 2. ?ne 0ay to %o it 0o&% be to &se o&r three neste% oops again to
generate a possibe trin&!eoti%es, then oo- &p the !o&nt for ea!h one an% %e!i%e
0hether or not to print it:
for base" in '3!3 3+3 3(3 3)3*:
for base2 in '3!3 3+3 3(3 3)3*:
for base3 in '3!3 3+3 3(3 3)3*:
trinucleotide B base" C base2 C base3
if counts.get(trinucleotide #& BB 2:
print(trinucleotide&
4s 0e !an see from the o&tp&t, this 0or-s perfe!ty 0e:
!+)
+(!
+)(
(!+
(&t it seems ineffi!ient to go thro&gh the 0hoe pro!ess of generating a possibe
trin&!eoti%es again, 0hen the information 0e 0ant 5 the ist of trin&!eoti%es 5 is
area%y in the %i!tionary. 4 better approa!h 0o&% be to rea% the ist of -eys
%ire!ty from the %i!tionary, 0hi!h is 0hat the keys metho% %oes.
1I3 Chapter I: Di!tionaries
Iterating over 1e+s
,hen &se% on a %i!tionary, the keys metho% ret&rns a ist of a the -eys in the
%i!tionary:
print(counts.:eys(&&
+oo-ing at the o&tp&t
1
!onfirms that this is the ist of trin&!eoti%es 0e 0ant to
!onsi%er :remember that 0e.re oo-ing for trin&!eoti%es 0ith a !o&nt of t0o, so 0e
%on.t nee% to !onsi%er ones that aren.t in the %i!tionary as 0e area%y -no0 that
they have a !o&nt of Gero=:
'3!+(3 3!)(3 3!+)3 3(+!3 3)+(3 3)()3 3(!+3 3)(!3 3!!+3 3+(!3
3()+3 3+!)3 3+)(3 3)(+3*
*sing keys, o&r !o%e for printing o&t a the trin&!eoti%es that appear t0i!e in the
D;4 se<&en!e be!omes a ot more !on!ise:
for trinucleotide in counts.:eys(&:
if counts.get(trinucleotide& BB 2:
print(trinucleotide&
$his version prints e1a!ty the same set of trin&!eoti%es as the more verbose
metho%:
!+)
(!+
+(!
+)(
1 'f yo&.re &sing #ython 3 yo& might see sighty %ifferent o&tp&t here, b&t a the !o%e e1ampes 0i 0or-
9&st the same
1IB Chapter I: Di!tionaries
(efore 0e move on, ta-e a moment to !ompare the o&tp&t imme%iatey above this
paragraph 0ith the o&tp&t from the three/oop version from earier in this se!tion.
3o&. noti!e that 0hie the set of trinucleotides is the same, the order in w,ic,
t,e+ appear is %ifferent. $his i&strates an important point abo&t %i!tionaries 5
they are inherently unor"ere". $hat means that 0hen 0e &se the keys metho% to
iterate over a %i!tionary, 0e !an.t rey on pro!essing the items in the same or%er
that 0e a%%e% them. $his is in !ontrast to ists, 0hi!h a0ays maintain the same
or%er 0hen ooping. 'f 0e 0ant to !ontro the or%er in 0hi!h -eys are printe% 0e
!an &se the sorted metho% to sort the ist before pro!essing it:
for trinucleotide in sorted(counts.:eys(&&:
if counts.get(trinucleotide& BB 2:
print(trinucleotide&
Iterating over items
'n the e1ampe !o%e above, the first thing 0e nee% to %o insi%e the oop is to oo-
&p the va&e for the !&rrent -ey. $his is a very !ommon pattern 0hen iterating over
%i!tionaries 5 so !ommon, in fa!t, that #ython has a spe!ia shorthan% for it.
'nstea% of %oing this:
for :ey in my/dict.:eys(&:
0alue B my/dict.get(:ey&
4 do something 2ith :ey and 0alue
,e !an &se the items metho% to iterate over pairs of %ata, rather than 9&st -eys:
for :ey 0alue in my/dict.items(&:
4 do something 2ith :ey and 0alue
$he items metho% %oes something sighty %ifferent from a the other metho%s
0e.ve seen so far in this boo-D rather than ret&rning a single value, or a list of
1IC Chapter I: Di!tionaries
values, it ret&rns a list of pairs of values. $hat.s 0hy 0e have to give t0o variabe
names at the start of the oop. 7ere.s ho0 0e !an &se the items metho% to pro!ess
o&r %i!tionary of trin&!eoti%e !o&nts 9&st i-e before:
for trinucleotide count in counts.items(&:
if count BB 2:
print(trinucleotide&
$his metho% is generay preferre% for iterating over items in a %i!tionary, as it
ma-es the intention of the !o%e very !ear.
(ecap
,e starte% this !hapter by e1amining the probem of storing paire% %ata in #ython.
4fter oo-ing at a !o&pe of &nsatisfa!tory 0ays to %o it &sing toos that 0e.ve
area%y earne% abo&t, 0e intro%&!e% a ne0 type of %ata str&!t&re 5 the %i!tionary
5 0hi!h offers a m&!h ni!er so&tion to the probem of storing paire% %ata.
+ater in the !hapter, 0e sa0 that the rea benefit of &sing %i!tionaries is the
effi!ient oo-&p they provi%e. ,e sa0 ho0 to !reate %i!tionaries an% manip&ate
the items in them, an% severa %ifferent 0ays to oo- &p va&es for -no0n -eys. ,e
aso sa0 ho0 to iterate over a the items in %i!tionary.
'n the pro!ess, 0e &n!overe% a fe0 restri!tions on 0hat %i!tionaries are !apabe of
5 0e.re ony ao0e% to &se a !o&pe of %ifferent %ata types for -eys, they m&st be
&ni<&e, an% 0e !an.t rey on their or%er. J&st as a physi!a %i!tionary ao0s &s to
rapi%y oo- &p the %efinition for a 0or% b&t not the other 0ay ro&n%, #ython
%i!tionaries ao0 &s to rapi%y oo- &p the va&e asso!iate% 0ith a -ey, b&t not the
reverse.
(e!a&se of their abiity to oo- &p a given va&e very rapi%y given a -ey, %i!ts are
e1tremey &sef& 0hen storing !ompe1 %ata. $a-e a oo- at the ast fe0 se!tions of
1IF Chapter I: Di!tionaries
the !hapter on !ompe1 %ata str&!t&res in A"vance" Python for 8iologists for a set of
e1ampes of this te!hni<&e.
1IH Chapter I: Di!tionaries
!ercises
(/5 translation
,rite a program that 0i transate a D;4 se<&en!e into protein. 3o&r program
sho&% &se the stan%ar% geneti! !o%e 0hi!h !an be fo&n% at this *2+
1
.
1 http://000.n!bi.nm.nih.gov/$a1onomy/ta1onomyhome.htm/in%e1.!giE!hapterZtgen!o%es["R1
1II Chapter I: Di!tionaries
&olutions
(/5 translation
$he %es!ription of this e1er!ise is very short, b&t it hi%es <&ite a bit of !ompe1ity8
$o transate a D;4 se<&en!e 0e nee% to !arry o&t a n&mber of %ifferent steps. First,
0e have to spit &p the se<&en!e into !o%ons. $hen, 0e nee% to go thro&gh ea!h
!o%on an% transate it into the !orrespon%ing amino a!i% resi%&e. Finay, 0e nee%
to !reate a protein se<&en!e by a%%ing a the amino a!i% resi%&es together.
,e. start off by fig&ring o&t ho0 to spit a D;4 se<&en!e into !o%ons. (e!a&se
this e1er!ise is <&ite tri!-y, 0e. pi!- a very short test D;4 se<&en!e to 0or- on 5
9&st three !o%ons:
dna B ,!+(++)((+,
7o0 are 0e going to spit &p the D;4 se<&en!e into gro&ps of three basesE 't.s
tempting to try to &se the split metho%, b&t remember that the split metho%
ony 0or-s if the things yo& 0ant to spit are separate% by a %eimiter. 'n o&r !ase,
there.s nothing separating the !o%ons, so split 0i not hep &s.
"omething that might be abe to hep &s is s&bstring notation. ,e -no0 that this
ao0s &s to e1tra!t part of a string, so 0e !an %o something i-e this:
dna B ,!+(++)((+,
codon" B dna'#:3*
codon2 B dna'3:%*
codon3 B dna'%:J*
print(codon" codon2 codon3&
4s 0e !an see from the o&tp&t, this 0or-s:
1I@ Chapter I: Di!tionaries
(3!+(3 3++)3 3((+3&
b&t it.s not a great so&tion, as 0e have to fi in the n&mbers man&ay. "in!e the
n&mbers foo0 a very pre%i!tabe pattern, it sho&% be possibe to generate them
a&tomati!ay. $he start position for ea!h s&bstring is initiay Gero, then goes &p by
three for ea!h s&!!essive !o%on. $he stop position is 9&st the start position p&s
three.
2e!a that the 9ob of the range f&n!tion is to generate se<&en!es of n&mbers. 'n
or%er to generate the se<&en!e of s&bstring start positions, 0e nee% to &se the
three/arg&ment version of range, 0here the first arg&ment is the n&mber to start
at, the se!on% arg&ment is the n&mber to finish at, an% the thir% arg&ment is the
step siGe. For o&r D;4 se<&en!e above, the n&mber to start at is Gero, an% the step
siGe is three. $he n&mber to finish at it not si1 b&t seven, be!a&se ranges are
e1!&sive at the finish. $his bit of !o%e sho0s ho0 0e !an &se the range f&n!tion
to generate the ist of start positions:
for start in range(#73&:
print(start&
#
3
%
$o fin% the stop position for a given start position 0e 9&st a%% three, so 0e !an
easiy spit o&r D;4 into !o%ons &sing a oop:
dna B ,!+(++)((+,
for start in range(#73&:
codon B dna'start:startC3*
print(,one codon is, C codon&
1@0 Chapter I: Di!tionaries
one codon is !+(
one codon is ++)
one codon is ((+
$his 0or-s fine for o&r test D;4 se<&en!e, b&t if 0e give it a shorter se<&en!e 0e
0i get in!ompete an% empty !o%ons:
dna B ,!+(++,
for start in range(#73&:
codon B dna'start:startC3*
print(codon&
one codon is !+(
one codon is ++
one codon is
an% if 0e give it a onger se<&en!e, 0e 0i miss o&t the fo&rth an% s&bse<&ent
!o%ons:
dna B ,!+(++)((+(!!()((()+!(!+,
for start in range(#73&:
codon B dna'start:startC3*
print(,one codon is , C codon&
one codon is !+(
one codon is ++)
one codon is ((+
Ceary 0e nee% to mo%ify the se!on% arg&ment to range 5 the position to finish
the se<&en!e of n&mbers 5 in or%er to ta-e into a!!o&nt the ength of the D;4
se<&en!e. 4t this point, 0e have to !onfront the probem of 0hat to %o if 0e.re
given a D;4 se<&en!e 0hose ength is not an e1a!t m&tipe of three. Ceary, 0e
!annot transate an in!ompete !o%on, so 0e 0ant the start position of the fina
1@1 Chapter I: Di!tionaries
!o%on to e<&a to the ength of the D;4 se<&en!e min&s t0o. $his g&arantees that
there 0i a0ays be t0o more !hara!ters foo0ing the position of the fina !o%on
start 5 i.e. eno&gh for a !ompete !o%on.
7ere.s the mo%ifie% !o%e:
dna B ,!+(++)((+,
4 calculate the start position for the final codon
last/codon/start B len(dna& X 2
4 process the dna seAuence in three base chun:s
for start in range(#last/codon/start3&:
codon B dna'start:startC3*
print(,one codon is , C codon&
;o0 that 0e -no0 ho0 to spit a D;4 se<&en!e &p into !o%ons, et.s t&rn o&r
attention to the probem of transating those !o%ons. 'f 0e p& &p the *2+ from
the e1er!ise %es!ription in a 0eb bro0ser, 0e !an see the stan%ar% !o%on
transation tabe in vario&s formats. "toring this transation tabe seems i-e a
perfe!t 9ob for a %i!tionary: 0e have !o%ons :-eys= an% amino a!i% resi%&es :va&es=
an% 0e 0ant to be abe to oo- &p the amino a!i% for a given !o%on.
1@2 Chapter I: Di!tionaries
7ere.s a bit of !o%e 5 it.s a!t&ay a singe statement, sprea% o&t over m&tipe ines
5 0hi!h !reates a %i!tionary to ho% the transation tabe:
gencode @ M
)!"!)0);)% )!"C)0);)% )!"")0);)% )!"#)0)&)%
)!C!)0)")% )!CC)0)")% )!C#)0)")% )!C")0)")%
)!!C)0)N)% )!!")0)N)% )!!!)0)S)% )!!#)0)S)%
)!#C)0)S)% )!#")0)S)% )!#!)0)Q)% )!##)0)Q)%
)C"!)0)T)% )C"C)0)T)% )C"#)0)T)% )C"")0)T)%
)CC!)0)P)% )CCC)0)P)% )CC#)0)P)% )CC")0)P)%
)C!C)0):)% )C!")0):)% )C!!)0)U)% )C!#)0)U)%
)C#!)0)Q)% )C#C)0)Q)% )C##)0)Q)% )C#")0)Q)%
)#"!)0)V)% )#"C)0)V)% )#"#)0)V)% )#"")0)V)%
)#C!)0)!)% )#CC)0)!)% )#C#)0)!)% )#C")0)!)%
)#!C)0)()% )#!")0)()% )#!!)0)E)% )#!#)0)E)%
)##!)0)#)% )##C)0)#)% )###)0)#)% )##")0)#)%
)"C!)0)S)% )"CC)0)S)% )"C#)0)S)% )"C")0)S)%
)""C)0)7)% )""")0)7)% )""!)0)T)% )""#)0)T)%
)"!C)0)')% )"!")0)')% )"!!)0)_)% )"!#)0)_)%
)"#C)0)C)% )"#")0)C)% )"#!)0)_)% )"##)0)R)N
,e !an oo- &p the amino a!i% for a given !o%on &sing either of the t0o metho%s
that 0e earne% abo&t:
print(gencode'3)!+3*&
print(gencode.get(3(+)3&&
1
S
1@3 Chapter I: Di!tionaries
'f 0e oo- &p the amino a!i% for ea!h !o%on insi%e the oop of o&r origina !o%e, 0e
!an print both the !o%on an% the amino a!i% transation
1
:
dna B ,!+(++)((+,
last/codon/start B len(dna& F 2
for start in range(#last/codon/start3&:
codon B dna'start:startC3*
aa B gencode.get(codon&
print(,one codon is , C codon&
print(,the amino acid is , C aa&
one codon is !+(
the amino acid is P
one codon is ++)
the amino acid is 7
one codon is ((+
the amino acid is (
$his is starting to oo- promising. $he fina step is to a!t&ay %o something 0ith
the amino a!i% resi%&es rather than 9&st printing them. 4 ni!e i%ea is to ta-e o&r
!&e from the 0ay that a ribosome behaves an% a%% ea!h ne0 amino a!i% resi%&e
onto the en% of a protein to !reate a gra%&ay/gro0ing string:
1 From no0 on, 0e 0on.t in!&%e the statement 0hi!h !reates the %i!tionary in o&r !o%e sampes as it ta-es
&p too m&!h room, so if yo& 0ant to try r&nning these yo&rsef yo&. nee% to a%% it ba!- at the top.
1@B Chapter I: Di!tionaries
dna B ,!+(++)((+,
last/codon/start B len(dna& F 2
protein B ,,
for start in range(#last/codon/start3&:
codon B dna'start:startC3*
aa B gencode.get(codon&
protein B protein C aa
print(,protein seAuence is , C protein&
'n the above !o%e, 0e !reate a ne0 variabe to ho% the protein se<&en!e
imme%iatey before 0e start the oop :ine 3=, then a%% a singe !hara!ter onto the
en% of that variabe ea!h time ro&n% the oop :ine H=. (y the time 0e e1it the oop,
0e have b&it &p the !ompete protein se<&en!e an% 0e !an print it o&t :ine I=:
protein seAuence is P7(
$his oo-s i-e a very &sef& bit of !o%e, so et.s t&rn it into a f&n!tion. ?&r f&n!tion
0i ta-e one arg&ment 5 the D;4 se<&en!e as a string 5 an% 0i ret&rn a string
!ontaining the protein se<&en!e
1
:
def translate/dna(dna&:
last/codon/start B len(dna& F 2
protein B ,,
for start in range(#last/codon/start3&:
codon B dna'start:startC3*
aa B gencode.get(codon&
protein B protein C aa
return protein
,e !an no0 test o&r f&n!tion by printing o&t the protein transation for a fe0 more
test se<&en!es:
1 3o&. noti!e that this f&n!tion reies on the gencode variabe 0hi!h is %efine% o&tsi%e the f&n!tion 5
something that ' to% yo& not to %o in !hapter C. $his is an e1!eption to the r&e: %efining the gen!o%e
variabe insi%e the f&n!tion means that it 0o&% have to be !reate% ane0 ea!h time 0e 0ante% to transate
a D;4 se<&en!e.
1
2
3
4
5
6
7
8
1@C Chapter I: Di!tionaries
print(translate/dna(,!+(++)((+,&&
print(translate/dna(,!+)(!+)(!+)(++()++!+)(!+)!(,&&
print(translate/dna(,actgatcgtagctagctgacgtatcgtat,&&
print(translate/dna(,!)(!+)(!+)(+<!)(+!)(!+)(+!)+)(,&&
$he o&tp&t from this !o%e sho0s that 0e r&n into a probem 0ith the thir%
se<&en!e:
P7(
L@QS??L@Y
+racebac: (most recent call last&:
7ile ,dna/translation.py, line 3# in ;module-
print(translate/dna(,actgatcgtagctagctgacgtatcgtat,&&
7ile ,dna/translation.py, line 2$ in translate/dna
protein B protein C aa
+ype9rror: cannot concatenate 3str3 and 3<one+ype3 obDects
$he probem o!!&rs 0hen 0e try to oo- &p the amino a!i% for the first !o%on of the
thir% se<&en!e 5 >a!t>. (e!a&se the thir% se<&en!e is in o0er !ase b&t the
transation tabe %i!tionary is in &pper !ase, the -ey isn.t fo&n%, the get metho%
ret&rns None, an% 0e get an error. Fi1ing it is straightfor0ar% 5 0e 9&st nee% to
!onvert the !o%on to &pper !ase before oo-ing &p the amino a!i%:
def translate/dna(dna&:
last/codon/start B len(dna& F 2
protein B ,,
for start in range(#last/codon/start3&:
codon B dna'start:startC3*
aa B gencode.get(codon.upper(&&
protein B protein C aa
return protein
;o0 the o&tp&t sho0s that the first three se<&en!es are fine, b&t that o&r f&n!tion
has a probem transating the fo&rth se<&en!e:
1@F Chapter I: Di!tionaries
P7(
L@QS??L@Y
+@QS??+RQ
+racebac: (most recent call last&:
7ile ,dna/translation.py, line 3" in ;module-
print(translate/dna(,!)(!+)(!+)(+<!)(+!)(!+)(+!)+)(,&&
7ile ,dna/translation.py, line 2$ in translate/dna
protein B protein C aa
+ype9rror: cannot concatenate 3str3 and 3<one+ype3 obDects
Ran!ing at the inp&t se<&en!es, it.s not !ear 0hat the probem is. +et.s try
printing the !o%ons as they.re transate% in or%er to i%entify the one that.s !a&sing
the error:
def translate/dna(dna&:
last/codon/start B len(dna& F 2
protein B ,,
for start in range(#last/codon/start3&:
codon B dna'start:startC3*
print(,about to translate codon: , C codon&
aa B gencode.get(codon.upper(&&
protein B protein C aa
return protein
print(translate/dna(,!)(!+)(!+)(+<!)(+!)(!+)(+!)+)(,&&
$he o&tp&t sho0s 0here the probem ies:
1@H Chapter I: Di!tionaries
about to translate codon: !)(
about to translate codon: !+)
about to translate codon: (!+
about to translate codon: )(+
about to translate codon: <!)
+racebac: (most recent call last&:
7ile ,dna/translation.py, line 32 in ;module-
print(translate/dna(,!)(!+)(!+)(+<!)(+!)(!+)(+!)+)(,&&
7ile ,dna/translation.py, line 2% in translate/dna
protein B protein C aa
+ype9rror: cannot concatenate 3str3 and 3<one+ype3 obDects
$here is an &n-no0n base in the mi%%e of the D;4 se<&en!e, 0hi!h !a&ses o&r
f&n!tion to try to oo- &p the amino a!i% for the !o%on N!C, 0hi!h !a&ses an error
be!a&se that !o%on isn.t in the %i!tionary. 7o0 sho&% 0e fi1 thisE ,e !o&% a%% an
if statement to the f&n!tion 0hi!h ony transates the D;4 se<&en!e if it %oesn.t
!ontain any &nambig&o&s bases, b&t that seems a itte too !onservative 5 there are
penty of sit&ations in 0hi!h 0e might 0ant to generate a protein se<&en!e for a
D;4 se<&en!e that has &n-no0n bases. ,e !o&% a%% an if statement insi%e the
oop 0hi!h ony transates a given !o%on if it %oesn.t !ontain any &nambig&o&s
bases, b&t that 0o&% ea% to protein transations of an in!orre!t ength 5 0e -no0
that the !o%on ;4C 0i transate to an amino a!i%, 0e 9&st %on.t -no0 0hi!h one it
0i be.
$he most sensibe so&tion seems to be to transate any !o%on 0ith an &n-no0n
base into the symbo for an &n-no0n amino a!i% resi%&e, 0hi!h is W. $he optiona
se!on% arg&ment to the get f&n!tion ma-es it very easy to %o 9&st that:
1@I Chapter I: Di!tionaries
def translate/dna(dna&:
last/codon/start B len(dna& F 2
protein B ,,
for start in range(#last/codon/start3&:
codon B dna'start:startC3*
aa B gencode.get(codon.upper(& 3Z3&
protein B protein C aa
return protein
an% no0 0e !an transate a fo&r of o&r test se<&en!es !orre!ty:
print(translate/dna(,!+(++)((+,&&
print(translate/dna(,!+)(!+)(!+)(++()++!+)(!+)!(,&&
print(translate/dna(,actgatcgtagcttgcttacgtatcgtat,&&
print(translate/dna(,!)(!+)(!+)(+<!)(+!)(!+)(+!)+)(,&&
P7(
L@QS??L@Y
+@QS??+RQ
+L@QZSQSRS
4t this point, it.s a goo% i%ea to t&rn these test se<&en!es into assert statements
5 that 0ay, 0e !an easiy re/test the f&n!tion if 0e ma-e some !hanges to it in the
f&t&re:
assert(translate/dna(,!+(++)((+,&& BB ,P7(,
assert(translate/dna(,!+)(!+)(!+)(++()++!+)(!+)!(,&& BB ,L@QS??L@Y,
assert(translate/dna(,actgatcgtagcttgcttacgtatcgtat,&& BB ,+@QS??+RQ,
assert(translate/dna(,!)(!+)(!+)(+<!)(+!)(!+)(+!)+)(,&& BB ,+L@QZSQSRS,
1@@ Chapter @: Fies, programs, an% &ser inp&t
%: )iles* programs* and user input
/ile contents an" manipulation
2ea%ing from an% 0riting to fies 0as one of the first things 0e oo-e% at in this
boo-, ba!- in !hapter 3. For some programs, ho0ever, 0e.re not 9&st !on!erne% 0ith
the !ontents of fies, b&t 0ith fies an% fo%ers themseves. $his is espe!iay i-ey
to be the !ase for programs that have to operate as part of a 0or- fo0 invoving
other toos an% soft0are. For e1ampe, 0e may nee% to !opy, move, rename an%
%eete fies, or 0e may nee% to pro!ess a fies in a !ertain fo%er.
4tho&gh it seems i-e a simpe tas- :after a, the fie manager toos that !ome
0ith yo&r operating system !an !arry most of them o&t=, fie manip&ation in a
ang&age i-e #ython is a!t&ay <&ite tri!-y. $hat.s be!a&se the !o%e that 0e 0rite
has to f&n!tion i%enti!ay on %ifferent operating systems 5 in!&%ing ,in%o0s,
+in&1 an% Ma! ma!hines 5 0hi!h may han%e fies <&ite %ifferenty. 4 %is!&ssion of
the %ifferen!es bet0een operating systems is 0ay beyon% the s!ope of this boo-,
b&t to give one e1ampe, *;'J/base% systems i-e +in&1 an% ?"J have the !on!ept
of fie permissions 0hi!h is a!-ing in ,in%o0s.
$han-f&y, #ython in!&%es a !o&pe of mo%&es
1
that ta-e !are of these %ifferen!es
for &s an% provi%e &s 0ith a set of &sef& f&n!tions for manip&ating fies. $he
mo%&es. names are os :short for ?perating "ystem= an% shutil :short for "7e
*$'+ities=. 'n the ne1t se!tion 0e. see ho0 they !an be &se% to !arry o&t vario&s
!ommon :b&t important= tas-s.
5 note on t,e code examples
"in!e the !o%e e1ampes in this !hapter &navoi%aby invove intera!tion 0ith the
operating system, some of the %etais 0i be operating/system spe!ifi!. 'n
1 $a-e a oo- ba!- at !hapter H for a remin%er of ho0 mo%&es 0or-.
200 Chapter @: Fies, programs, an% &ser inp&t
parti!&ar, many of the fie manip&ation f&n!tions ta-e paths as arg&ments, 0hi!h
%iffer !onsi%eraby bet0een operating systems. 4 path is the short bit of te1t that
tes yo& the o!ation of a fie in the fie system. ?n +in&1 an% ?"J ma!hines, the
path to a fie or fo%er typi!ay oo-s i-e this:
/path/to/my/file.txt
0hereas on ,in%o0s ma!hines, they oo- i-e this:
c:.path.to.my.file.txt
Moreover, the s&!!ess of the !o%e e1ampes for many f&n!tions reies on the fies
an% fo%ers a!t&ay being present on the !omp&ter on 0hi!h the e1ampes are r&n.
$he !o%e e1ampes in this !hapter 0i &se +in&1/stye paths, an% 0i refer to
fo%ers an% fies on my !omp&ter, so if yo& 0ant to try r&nning them, yo&.
probaby nee% to !hange the paths to refer to fies on yo&r o0n !omp&ter.
8asic file manipulation
$o rename an e1isting fie, 0e simpy import the os mo%&e, then &se the
os.rename f&n!tion. $he os.rename f&n!tion ta-es t0o arg&ments, both strings.
$he first is the !&rrent name of the fie, the se!on% is the ne0 name:
import os
os.rename(,old.txt, ,ne2.txt,&
$he above !o%e ass&mes that the fie ol"=t!t is in the fo%er 0here 0e are r&nning
o&r #ython program. 'f it.s ese0here in the fiesystem, then 0e have to give the
!ompete path:
os.rename(,/home/martin/biology/old.txt, ,/home/martin/biolgy/ne2.txt,&
201 Chapter @: Fies, programs, an% &ser inp&t
'f 0e spe!ify a %ifferent fo%er, b&t the same fie name, in the se!on% arg&ment,
then the f&n!tion 0i move the fie from one fo%er to another:
os.rename(,/home/martin/biology/old.txt, ,/home/martin/python/old.txt,&
?f !o&rse, 0e !an move an% rename a fie in one step if yo& i-e:
os.rename(,/home/martin/biology/old.txt, ,/home/martin/python/ne2.txt,&
os.rename 0or-s on fo%ers as 0e as fies:
os.rename(,/home/martin/old/folder, ,/home/martin/ne2/folder,&
'f 0e try to move a fie to a fo%er that %oesn.t e1ist 0e. get an error. ,e nee% to
!reate the ne0 fo%er first 0ith the os.mkdir f&n!tion:
os.m:dir(,/home/martin/python,&
'f 0e nee% to !reate a b&n!h of %ire!tories a in one go, 0e !an &se the os.mkdirs
f&n!tion :note the s on the en% of the name=:
os.m:dir(,/a/long/path/2ith/lots/of/folders,&
$o !opy a fie or fo%er 0e &se the shutil mo%&e. ,e !an !opy a singe fie 0ith
shutil.copy:
shutil.copy(,/home/martin/original.txt, ,/home/martin/copy.txt,&
or a fo%er 0ith shutil.copytree:
202 Chapter @: Fies, programs, an% &ser inp&t
shutil.copytree(,/home/martin/original/folder,
,/home/martin/copy/folder,&
$o test 0hether a fie or fo%er e1ists, &se os.path.exists:
if os.path.exists(,/home/martin/email.txt,&:
print(,Rou ha0e mail5,&
1eleting files an" fol"ers
$here are %ifferent f&n!tions for %eeting fies, empty fo%ers, an% non/empty
fo%ers. $o %eete a singe fie, &se os.remoCe:
os.remo0e(,/home/martin/un2anted/file.txt,&
$o %eete an empty fo%er, &se os.rmdir:
os.rmdir(,/home/martin/emtpy,&
$o %eete a fo%er an% a the fies in it, &se shutil.rmtree
shutil.rmtree(,home/martin/full,&
6isting fol"er contents
$he os.listdir f&n!tion ret&rns a ist of fies an% fo%ers. 't ta-es a singe
arg&ment 0hi!h is a string !ontaining the path of the fo%er 0hose !ontents yo&
0ant to sear!h. $o get a ist of the !ontents of the !&rrent 0or-ing %ire!tory, &se
the string >.> for the path:
203 Chapter @: Fies, programs, an% &ser inp&t
for file/name in os.listdir(,.,&:
print(,one file name is , C file/name&
$o ist the !ontents of a %ifferent fo%er, 0e 9&st give the path as an arg&ment:
for file/name in os.listdir(,/home/martin,&:
print(,one file name is , C file/name&
(unning e!ternal programs
4nother feat&re of #ython that invoves intera!tion 0ith the operating system is
the abiity to r&n e1terna programs. J&st i-e fie an% fo%er manip&ation, the
abiity to r&n other programs is very &sef& 0hen &sing #ython as part of a 0or-
fo0. 't ao0s &s to &se e1isting toos that 0o&% be very time/!ons&ming to
re!reate in #ython, or that 0o&% r&n very so0y.
2&nning e1terna programs from 0ithin yo&r #ython !o%e !an be a tri!-y b&siness,
an% this feat&re 0o&%n.t normay be !overe% in an intro%&!tory programming
!o&rse. 7o0ever, it.s so &sef& for bioogy :an% s!ien!e in genera= that 0e.re going
to !over it here, abeit in a simpifie% form.
4s 0ith the above se!tion on fie operations, the e1a!t %etais of ho0 e1terna
programs are r&n 0i vary 0ith yo&r operating system an% the 0ay yo&r !omp&ter
is set &p. ?n *;'J/base% systems, the program that yo& 0ant to r&n might area%y
be in yo&r path, in 0hi!h !ase yo& !an simpy &se the name of the e1e!&tabe as the
string to be e1e!&te%. For the e1ampe !o%e beo0, '. give the f& path to
e1e!&tabes on my !omp&ter, 0hi!h oo- something i-e this:
/home/martin/soft2are/myprogram
'f yo&.re on ,in%o0s, yo&r paths 0i probaby oo- i-e this:
20B Chapter @: Fies, programs, an% &ser inp&t
c:.2indo2s.Program files.myprogram.myprogram.exe
4n% on ?"J, they 0i oo- i-e this:
/!pplications/myprogram
4s before, if yo& 0ant to try r&nning any of these e1ampes, ma-e s&re that yo&
!hange the paths to point to rea e1e!&tabes on yo&r !omp&ter.
(unning a program
$he f&n!tions for r&nning e1terna program resi%e in the su$process mo%&e.
$he reasoning behin% the name is sighty !onvo&te%: 0hen ta-ing abo&t
operating systems, a r&nning program is !ae% a process, an% a program that is
starte% by another program is !ae% a subprocess.
$o r&n an e1terna program, &se the su$process.call f&n!tion. $his f&n!tion
ta-es a singe string arg&ment !ontaining the path to the e1e!&tabe yo& 0ant to
r&n:
import subprocess
subprocess.call(,/bin/date,&
4ny o&tp&t that is pro%&!e% by the e1terna program is printe% straight to the
s!reen 5 in this !ase, the o&tp&t from the +in&1 date program:
7ri Hul 2% "$:"$:2% NS+ 2#"3
'f 0e 0ant to s&ppy !omman%/ine options to the e1terna program then 0e 9&st
in!&%e them in the string, an% set the optiona shell arg&ment to "rue. 7ere 0e
!a the +in&1 date program 0ith the options 0hi!h !a&se it to 9&st print the
month:
20C Chapter @: Fies, programs, an% &ser inp&t
subprocess.call(,/bin/date C[N, shellB+rue&
Huly
&aving program output
?ften, 0e 0ant to r&n some e1terna program an% then store the o&tp&t in a
variabe so that 0e !an %o something &sef& 0ith it. For this, 0e &se
su$process.check_output, 0hi!h ta-es e1a!ty the same arg&ments as
su$process.call:
current/month B subprocess.chec:/output(,/bin/date C[N, shellB+rue&
J&st i-e 0hen rea%ing fie !ontents, the o&tp&t from an e1terna program !an r&n
over m&tipes ines that en% 0ith ne0 ine !hara!ters, so yo& probaby nee% to &se
rstrip to remove them before !arrying o&t any pro!essing.
,ser input makes our programs more fle!ible
$he e1er!ises an% e1ampes that 0e.ve seen so far in this boo- have &se% t0o
%ifferent 0ays of getting %ate into a program. For sma bits of %ata, i-e short D;4
se<&en!es, restri!tion enGyme motifs, an% gene a!!ession names, 0e.ve simpy
store% the %ata %ire!ty in a variabe i-e this:
dna B ,!+)(!+)(+(!)+!()+!)(,
,hen %ata is mi1e% in 0ith the !o%e in this manner, it is sai% to be har"Bco"e".
For arger pie!es of %ata, i-e onger D;4 se<&en!es an% sprea%sheet/i-e %ata,
0e.ve typi!ay rea% the information from an e1terna te1t fie. For many p&rposes,
this is a better so&tion than har%/!o%ing the %ata, as it ao0s the separation of
20F Chapter @: Fies, programs, an% &ser inp&t
%ata an% !o%e, ma-ing o&r programs easier to rea%. 7o0ever, in a the e1ampes
0e.ve seen so far, the names of the fies from 0hi!h the %ata are rea% are sti har%/
!o%e%.
(oth of these approa!hes to getting %ata in to o&r program have the same
short!omings 5 if 0e 0ant to !hange the inp&t %ata, 0e have to open &p the !o%e
an% e%it it. 'n the !ase of har%/!o%e% variabes, 0e have to e%it the statement 0here
the variabes are !reate%. 'n the !ase of fies, 0e have t0o !hoi!es 5 0e !an either
e%it the !ontents of the fie, or e%it the har%/!o%e% fie name.
2ea/ife &sef& programs %on.t generay 0or- that 0ay. 'nstea%, they generay
ao0 &s to spe!ify inp&t fies an% options at the time 0hen 0e r&n the program,
rather than 0hen 0e.re 0riting it. $his ao0s programs to be m&!h more fe1ibe
an% easier to &se, espe!iay for a person 0ho %i%n.t 0rite the !o%e in the first pa!e.
'n the ne1t !o&pe of se!tions 0e.re going to see a !o&pe of toos for getting &ser
inp&t, b&t more importanty 0e.re going to ta- abo&t the transition from 0riting a
program that.s ony &sef& to yo&, to 0riting one that !an be &se% by other peope.
$his invoves starting to thin- abo&t the e1perien!e of &sing a program from the
perspe!tive of a &ser.
$here are many reasons 0hy yo& might nee% yo&r programs to be &sabe by
somebo%y 0ho.s not famiiar 0ith the !o%e. 'f yo& 0rite a program that soves a
probem for yo&, !han!es are that it !o&% sove a probem for yo&r !oeag&es an%
!oaborators as 0e. 'f yo& 0rite a program that forms a signifi!ant part of a pie!e
of 0or- 0hi!h yo& ater 0ant to p&bish, yo& many have to ma-e s&re that 0hoever
is peer/revie0ing yo&r paper !an get yo&r program 0or-ing as 0e. ?f !o&rse,
ma-ing yo&r program easier to &se for other peope means that it 0i aso be easier
to &se for yo&, a fe0 months after yo& have 0ritten it 0hen yo& have !ompetey
forgotten ho0 the !o%e 0or-s8
20H Chapter @: Fies, programs, an% &ser inp&t
5nteractive user input
$o get intera!tive inp&t from the &ser in o&r programs, 0e !an &se the input
f&n!tion :in #ython 2, this f&n!tion is -no0n as input_ra-=. input ta-es a singe
string arg&ment, 0hi!h is the prompt to be %ispaye% to the &ser, an% ret&rns the
va&e type% in as a string:
accession B input(,9nter the accession name,&
4 do something 2ith the accession 0ariable
$he input f&n!tion behaves a itte %ifferenty to other f&n!tions an% metho%s
0e.ve seen, be!a&se it has to 0ait for something to happen before it !an ret&rn a
va&e 5 the &ser has to type in a string an% press enter. $he &ser inp&t 0i be
ret&rne% as a string :so if 0e nee% to &se is as something ese 5 e.g. a n&mber 5 0e.
have to %o the !onversion man&ay= an% 0i en% 0ith a ne0 ine :so 0e might
0ant to &se rstrip to remove it=.
Capt&ring &ser inp&t in this 0ay re<&ires &s to thin- <&ite !aref&y abo&t ho0 o&r
program behaves. #rograms that 0e 0rite to !arry o&t anaysis of arge %atasets 0i
often ta-e a !onsi%erabe amo&nt of time to r&n, so it.s important that 0e minimiGe
the !han!es of the &ser having to re/r&n them. ,hen &sing the input f&n!tion,
there are t0o sit&ations in parti!&ar that 0e 0ant to avoi%.
?ne is the sit&ation 0here 0e have a ong/r&nning program that re<&ires some
&ser inp&t, b&t %oesn.t ma-e this fa!t !ear to the &ser. ,hat !an happen in this
s!enario is that the &ser starts the program r&nning an% then s0it!hes their
attention to something ese, ass&ming that the program 0i !ontin&e to ma-e
progress in the ba!-gro&n%. 'f the &ser %oesn.t noti!e :or is not at their !omp&ter=
0hen the program rea!hes the point 0here it re<&ires inp&t an% hats, the program
may be st&!- 0aiting for inp&t for a ong time.
$he other s!enario to avoi% is that 0here a program r&ns for some time before
as-ing the &ser for inp&t, then fais to 0or- %&e to an in!orre!t inp&t or typo,
20I Chapter @: Fies, programs, an% &ser inp&t
re<&iring the &ser to re/start the program from s!rat!h.
4 goo% 0ay to avoi% both of these probems is to %esign o&r programs s&!h that
they !oe!t a ne!essary &ser inp&t at the start, before any ong/r&nning tas-s are
!arrie% o&t. ,e !an aso re%&!e the !han!es of in!orre!t inp&t on the part of the
&ser by offering !ear instr&!tions an% %o!&mentation.
4n important part of &ser inp&t is input vali"ation 5 !he!-ing that the inp&t
s&ppie% by the &ser ma-es sense. For e1ampe, yo& might re<&ire that a parti!&ar
inp&t is a n&mber bet0een some minim&m an% ma1im&m va&es, or that it.s a D;4
se<&en!e 0itho&t ambig&o&s bases, or that it.s the name of a fie that m&st e1ist. 4
goo% strategy for inp&t vai%ation is to !he!- the inp&t as soon as it.s re!eive%, an%
give the &ser a se!on% !han!e to enter their inp&t if it.s fo&n% to be invai%. ,e !an
han%e vai%ation of &ser inp&t &sing toos that 0e.ve area%y !overe% 5 oops an%
!on%itions 5 b&t a better 0ay to %o it is &sing e1!eptions. "ee the !hapter on
e1!eptions in A"vance" Python for 8iologists for e1ampes.
?ne big %ra0ba!- of getting &ser inp&t intera!tivey is that it ma-es it har%er to
r&n a program &ns&pervise% as part of a 0or- fo0. For most bioogi!a anayses,
spe!ifying program options 0hen it.s r&n &sing !omman% ine arg&ments is a better
approa!h.
4omman" line arguments
'f yo&.re &se% to &sing e1isting programs that have a !omman%/ine &ser interfa!e
:as oppose% to a graphi!a one= then yo&.re probaby famiiar 0ith !omman% ine
arg&ments
1
. $hese are the strings that yo& type on the !omman% ine after the
name of a program yo& 0ant to r&n:
myprogram one t2o three
1 ;ot to be !onf&se% 0ith the arg&ments that 0e give to f&n!tions, atho&gh they %o a simiar 9ob.
20@ Chapter @: Fies, programs, an% &ser inp&t
'n the above !o%e, one t0o an% three are the !omman% ine options. $o &se
!omman% ine arg&ments in o&r #ython s!ripts, 0e import the sys mo%&e. ,e !an
then a!!ess the !omman% ine arg&ments by &sing the spe!ia ist sys.argC.
2&nning the foo0ing !o%e:
import sys
print(sys.arg0&
0ith the !omman% ine:
python myprogram.py one t2o three
sho0s ho0 the eements of sys.argC are ma%e &p of the arg&ments given on the
!omman% ine:
'3myprogram.py3 3one3 3t2o3 3three3*
;ote that the first eement of sys.argC is a0ays the name of the program itsef,
so the first !omman% ine arg&ment is at in%e1 one, the se!on% at in%e1 t0o, et!.
J&st i-e 0ith input, options an% fienames given on the !omman% ine are store%
as strings, so if, for e1ampe, 0e 0ant to &se a !omman% ine arg&ment as a n&mber,
0e. have to !onvert it 0ith int.
Comman% ine arg&ments are a goo% 0ay of getting inp&t for yo&r #ython
programs for a n&mber of reasons. 4 the %ata yo&r program nee%s 0i be present
at the start of yo&r program, so yo& !an %o any ne!essary inp&t vai%ation :i-e
!he!-ing that fies are present= before starting any pro!essing. 4so, yo&r program
0i be abe to be r&n as part of a she s!ript, an% the options 0i appear in the
&ser.s she history.
210 Chapter @: Fies, programs, an% &ser inp&t
(ecap
,e starte% this !hapter by e1amining t0o feat&res of #ython that ao0 yo&r
programs to intera!t 0ith the operating system 5 fie manip&ation an% e1terna
pro!esses. ,e earne% 0hi!h f&n!tions to &se for !ommon fie system operations,
an% 0hi!h mo%&es they beong to. ,e aso ssa0 t0o 0ays to !a e1terna programs
from 0ithin yo&r #ython program.
,hen &sing these te!hni<&es to sove rea ife probems, or 0hen 0or-ing on the
e1er!ises, remember that yo& may en!o&nter errors that are nothing to %o 0ith
yo&r program. For instan!e, 0hen trying to manip&ate fies yo& may get an error if
a spe!ifie% fie %oesn.t e1ist or yo& %on.t have the ne!essary permissions to rename
it. "imiary, if yo& get &ne1pe!te% o&tp&t 0hen r&nning an e1terna program the
probem may ie 0ith the e1terna program or 0ith the 0ay that yo&.re !aing it,
rather than 0ith yo&r #ython program. $his is in !ontrast to the rest of the
e1er!ises in this boo-, 0hi!h are mosty sef/!ontaine%. 'f yo& r&n into %iffi!&ties
0hen &sing the toos in this !hapter, !he!- the e1terna fa!tors as 0e as !he!-ing
yo&r program !o%e.
'n the ast portion of the !hapter, 0e sa0 t0o %ifferent 0ays to get &ser inp&t 0hen
yo&r program r&ns. *sing !omman% ine arg&ments is generay better for the type
of programming that forms part of s!ientifi! resear!h.
211 Chapter @: Fies, programs, an% &ser inp&t
!ercises
'n the chapter?. fo%er in the e1er!ises %o0noa% there is a !oe!tion of fies 0ith
the e1tension ="na 0hi!h !ontain D;4 se<&en!es of varying ength, one per ine.
*se this set of fies for both e1er!ises.
?inning (/5 se2uences
,rite a program 0hi!h !reates nine ne0 fo%ers 5 one for se<&en!es bet0een 100
an% 1@@ bases ong, one for se<&en!es bet0een 200 an% 2@@ bases ong, et!. ,rite
o&t ea!h D;4 se<&en!e in the inp&t fies to a separate fie in the appropriate fo%er.
=mer counting
,rite a program that 0i !a!&ate the n&mber of a -mers of a given ength a!ross
a D;4 se<&en!es in the inp&t fies an% %ispay 9&st the ones that o!!&r more than
a given n&mber of times. 3o& program sho&% ta-e t0o !omman% ine arg&ments 5
the -mer ength, an% the !&toff n&mber.
212 Chapter @: Fies, programs, an% &ser inp&t
&olutions
?inning (/5 se2uences
$he first 9ob is to fig&re o&t ho0 to rea% a the D;4 se<&en!es. ,e !an get a ist of
a the fies in the fo%er by &sing os.listdir, b&t 0e. have to be !aref& to ony
rea% D;4 se<&en!es from fies that have the right fie name e1tension. 7ere.s a bit
of !o%e to start off 0ith:
import os

for file/name in os.listdir(,.,&:
if file/name.ends2ith(,.dna,&:
print(,reading seAuences from , C file/name&
,e !an !he!- the o&tp&t to ma-e s&re that 0e.re ony going to pro!ess the !orre!t
fies:
reading seAuences from xag.dna
reading seAuences from xaD.dna
reading seAuences from xaa.dna
reading seAuences from xab.dna
reading seAuences from xai.dna
reading seAuences from xae.dna
reading seAuences from xah.dna
reading seAuences from xaf.dna
reading seAuences from xac.dna
reading seAuences from xad.dna
$he ne1t step is to rea% the D;4 se<&en!es from ea!h fie. For ea!h fie that passes
the name test, 0e. open it, then pro!ess it one ine at a time an% !a!&ate the
ength of the D;4 se<&en!e:
213 Chapter @: Fies, programs, an% &ser inp&t
4 loo: at each file
for file/name in os.listdir(,.,&:
if file/name.ends2ith(,.dna,&:
print(,reading seAuences from , C file/name&
dna/file B open(file/name&
4 loo: at each line
for line in dna/file:
dna B line.rstrip(,.n,&
length B len(dna&
print(,found a dna seAuence 2ith length , C str(length&&
;oti!e ho0 0e.ve &se% rstrip to remove the ne0 ine !hara!ter 5 0e %on.t 0ant to
in!&%e it in the !o&nt of the se<&en!e ength, sin!e it.s not a base. ,ith ten fies,
an% ten D;4 se<&en!es per fie, this program generates over a h&n%re% ines of
o&tp&t 5 here.s the first fe0:
reading seAuences from xag.dna
found a dna seAuence 2ith length 432
found a dna seAuence 2ith length G"G
found a dna seAuence 2ith length %#4
found a dna seAuence 2ith length G7J
found a dna seAuence 2ith length %"J
found a dna seAuence 2ith length $##
found a dna seAuence 2ith length ""J
found a dna seAuence 2ith length 34"
found a dna seAuence 2ith length 3#3
found a dna seAuence 2ith length 4%J
reading seAuences from xaD.dna
found a dna seAuence 2ith length "2"
found a dna seAuence 2ith length 442
found a dna seAuence 2ith length $2#
$his oo-s goo% 5 0e.re getting a range of %ifferent siGes. ;e1t 0e have to fig&re o&t
0hi!h bin ea!h of the se<&en!es sho&% go in. (e!a&se the imits of the bins foo0
a reg&ar pattern, 0e !an &se the range f&n!tion to generate them. ,e !an
generate a ist of the o0er imits for ea!h bin by ta-ing a range of n&mbers from
21B Chapter @: Fies, programs, an% &ser inp&t
100 to 1000 0ith a step siGe of 100, then a%%ing @@ to get the &pper imit of the bin.
,e. go thro&gh this pro!ess for ea!h se<&en!e, !he!-ing if it beongs in ea!h bin
in t&rn:
4 go through each file in the folder
for file/name in os.listdir(,.,&:
4 chec: if it ends 2ith .dna
if file/name.ends2ith(,.dna,&:
print(,reading seAuences from , C file/name&
4 open the file and process each line
dna/file B open(file/name&
for line in dna/file:
4 calculate the seAuence length
dna B line.rstrip(,.n,&
length B len(dna&
print(,seAuence length is , C str(length&&
4 go through each bin and chec: if the seAuence belongs in it
for bin/lo2er in range("##"###"##&:
bin/upper B bin/lo2er C JJ
if length -B bin/lo2er and length ; bin/upper:
print(,bin is , C str(bin/lo2er& C , to , C str(bin/upper&&
$here are <&ite a fe0 eves of in%entation in the above !o%e, so yo& might have to
rea% it thro&gh a fe0 times. ,e have
• the oop for ea!h fie name
• the if statement that !he!-s the fie name
• the oop for ea!h se<&en!e in a fie
• the oop for ea!h bin
• the if statement that !he!-s if the se<&en!e beongs in the bin
21C Chapter @: Fies, programs, an% &ser inp&t
$he first fe0 ines of the o&tp&t sho0 that this approa!h 0or-s:
reading seAuences from xag.dna
seAuence length is 432
bin is 4## to 4JJ
seAuence length is G"G
bin is G## to GJJ
seAuence length is %#4
bin is %## to %JJ
seAuence length is G7J
bin is G## to GJJ
seAuence length is %"J
bin is %## to %JJ
seAuence length is $##
bin is $## to $JJ
seAuence length is ""J
$he fina step is to !reate the ne0 fo%ers, an% 0rite ea!h D;4 se<&en!e to the
appropriate one. ,e !an re/&se o&r range i%ea to generate the fo%er names an%
!reate them. $he name of the fo%er for a given bin is the o0er imit, foo0e% by
an &n%ers!ore, foo0e% by the &pper imit:
for bin/lo2er in range("##"###"##&:
bin/upper B bin/lo2er C JJ
bin/folder/name B str(bin/lo2er& C ,/, C str(bin/upper&
os.m:dir(bin/folder/name&
,hen 0e 0ant to 0rite o&t D;4 se<&en!e to a fie in a parti!&ar fo%er, 0e !an &se
the same naming s!heme to 0or- o&t the name of the fo%er. ?f !o&rse, 0e aso
have to fig&re o&t 0hat to !a the in%ivi%&a fies of D;4 se<&en!es. $he e1er!ise
%es!ription %i%n.t spe!ify any -in% of naming s!heme, so 0e. -eep things simpe
an% store the first D;4 se<&en!e in a fie !ae% 1="na, the se!on% in a fie !ae%
2="na, et!. ,e. nee% to !reate an e1tra variabe to ho% the n&mber of D;4
se<&en!es 0e.ve seen, an% to in!rement it after 0riting ea!h D;4 se<&en!e. 7ere.s
the 0hoe s!ript 5 it.s by far the argest program that 0e.ve 0ritten so far:
21F Chapter @: Fies, programs, an% &ser inp&t
import os

4 create a ne2 folder for each bin
for bin/lo2er in range("##"###"##&:
bin/upper B bin/lo2er C JJ
bin/folder/name B str(bin/lo2er& C ,/, C str(bin/upper&
os.m:dir(bin/folder/name&

4 create a 0ariable to hold the seAuence number
seA/number B "
4 process all files that end in .dna
for file/name in os.listdir(,.,&:
if file/name.ends2ith(,.dna,&:
print(,reading seAuences from , C file/name&
dna/file B open(file/name&
4 for each line calculate the seAuence length
for line in dna/file:
dna B line.rstrip(,.n,&
length B len(dna&
print(,seAuence length is , C str(length&&
4 figure out 2hich bin the seAuence belongs in
for bin/lo2er in range("##"###"##&:
bin/upper B bin/lo2er C JJ
if length -B bin/lo2er and length ; bin/upper:
4 once 2e :no2 the correct bin 2rite out the seAuence
print(,bin is , C str(bin/lo2er& C , to , C str(bin/upper&&
bin/folder/name B str(bin/lo2er& C ,/, C str(bin/upper&
output/path B bin/folder/name C 3/3 C str(seA/number& C 3.dna3
output B open(output/path ,2,&
output.2rite(dna&
output.close(&
4 increment the seAuence number
seA/number B seA/numberC"
21H Chapter @: Fies, programs, an% &ser inp&t
=mer counting
$o !ome &p 0ith a pan of atta!- for this e1er!ise, 0e m&st first thin- abo&t the
or%er in 0hi!h 0e pro!ess the %ata. Can 0e simpy rea% a singe D;4 se<&en!e,
!o&nt the -/mers, an% print the !o&nts i-e 0e %i% for the trin&!eoti%e e1ampe in
!hapter IE ;o, be!a&se 0e ony 0ant to print the -/mers 0hi!h o!!&r more than a
given n&mber of times a!ross all se<&en!es. 'n other 0or%s, 0e %on.t -no0 0hi!h
-/mers 0e 0ant to print the !o&nts for &nti 0e have finishe% pro!essing a
se<&en!es.
"o, 0e 0i have to ta!-e this probem in t0o stages. First, 0e 0i go thro&gh ea!h
se<&en!e one/by/one an% gra%&ay b&i% &p a ist of -/mer !o&nts. "e!on%, 0e 0i
go thro&gh the ist of !o&nts an% print ony the ones 0hose !o&nt is above the
!&toff.
7o0 0i 0e generate the -/mer !o&ntsE 4 goo% first step 0o&% be to fig&re o&t
ho0 to spit a D;4 se<&en!e into overapping -/mers of any given ength. ,e !an
&se a simiar approa!h to the one ta-en in the D;4 transation e1er!ise in !hapter
I: &se the range f&n!tion to generate a ist of the start positions of ea!h -/mer,
then &se s&bstring notation to e1tra!t the -/mer from the se<&en!e. 7ere.s a bit of
!o%e that prints a -/mers of a given siGe. ,e. &se a short test D;4 se<&en!e for
no0:
test/dna B ,!)+(+!()+(+!)(+!(),
print(test/dna&
:mer/siEe B 4
for start in range(#len(test/dna&F(:mer/siEeF"&"&:
:mer B test/dna'start:startC:mer/siEe*
print(:mer&
$he tri!-y bit is fig&ring o&t the arg&ments to the range f&n!tion. ,e -no0 that 0e
0ant to start at Gero an% in!rease by one ea!h time. $he finish position is the
ength of the se<&en!e, min&s the -/mer siGe :to ma-e s&re there is one -/mer.s
21I Chapter @: Fies, programs, an% &ser inp&t
0orth of bases after it= min&s one :to ao0 for the fa!t that the finish position is
e1!&sive=. $he range f&n!tion generates the start positions for ea!h -/mer, an% to
get the en% positions 0e 9&st a%% the -/mer siGe. ,e !an e1amine the o&tp&t from
this !o%e an% !he!- that it agrees 0ith o&t int&ition:
!)+(+!()+(+!)(+!()
!)+(
)+(+
+(+!
(+!(
+!()
!()+
()+(
)+(+
+(+!
(+!)
+!)(
!)(+
)(+!
(+!(
+!()
$o ma-e it easier to test this bit of !o%e, 0e. t&rn it into a f&n!tion. $he f&n!tion
0i ta-e t0o arg&ments. $he first arg&ment 0i be the D;4 se<&en!e as a string,
an% the se!on% arg&ment 0i be the -/mer siGe as a n&mber. 'nstea% of printing the
ist of -/mers, it 0i ret&rn a ist of them. 7ere.s the !o%e for the f&n!tion an% three
statements to test it:
21@ Chapter @: Fies, programs, an% &ser inp&t
def split/dna(dna :mer/siEe&:
:mers B '*
for start in range(#len(dna&F(:mer/siEeF"&"&:
:mer B dna'start:startC:mer/siEe*
:mers.append(:mer&
return :mers


print(split/dna(,!!+()+()!+, 4&&
print(split/dna(,!!+()+()!+, $&&
print(split/dna(,!!+()+()!+, %&&
4s 0e !an see from the o&tp&t, r&nning the f&n!tion m&tipe times 0ith the same
D;4 se<&en!e b&t %ifferent -/mer engths gives %ifferent res&ts, as e1pe!te%:
'3!!+(3 3!+()3 3+()+3 3()+(3 3)+()3 3+()!3 3()!+3*
'3!!+()3 3!+()+3 3+()+(3 3()+()3 3)+()!3 3+()!+3*
'3!!+()+3 3!+()+(3 3+()+()3 3()+()!3 3)+()!+3*
;o0 0e !an p&t this f&n!tion together 0ith the !o%e 0e %eveope% for ooping
thro&gh fies from the previo&s e1er!ise. $o !o&nt &p the -/mers, 0e 0i !reate an
empty %i!tionary at the start of the program :ine 11=, then for ea!h -/mer 0e fin%,
0e 0i oo- &p the !&rrent !o&nt for it in the %i!tionary :ine 1@=. 'f the -/mer is
not fo&n% in the %i!tionary :i.e. this is the first time 0e.ve seen that parti!&ar -/
mer= then 0e 0i say that the !&rrent !o&nt is Gero. ,e. then a%% one to the
!&rrent !o&nt :ine 20= an% store the res&t ba!- in the %i!tionary :ine 21=.
220 Chapter @: Fies, programs, an% &ser inp&t
import os
:mer/siEe B %

def split/dna(dna :mer/siEe&:
:mers B '*
for start in range(#len(dna&F(:mer/siEeF"&"&:
:mer B dna'start:startC:mer/siEe*
:mers.append(:mer&
return :mers

:mer/counts B UV
for file/name in os.listdir(,.,&:
if file/name.ends2ith(,.dna,&:
print(,reading seAuences from , C file/name&
dna/file B open(file/name&
for line in dna/file:
dna B line.rstrip(,.n,&
for :mer in split/dna(dna :mer/siEe&:
current/count B :mer/counts.get(:mer #&
ne2/count B current/count C "
:mer/counts':mer* B ne2/count

print(:mer/counts&
$his program generates a ot of o&tp&t8 7ere are the first fe0 ines so 0e !an see
that it.s 0or-ing:
U3gcagag3: "" 3aaataa3: "3 3ctttag3: "" 3gcagac3: "4 3ctttaa3: "2
3gcagaa3: "$ etc. etc.
4s panne%, 0e en% &p 0ith a big %i!tionary 0here the -eys are -mers an% the
va&es are their !o&nts.
;e1t, 0e have to pro!ess the kmer_counts %i!tionary. ,e. go thro&gh the items
in a oop, an% if the !o&nt is greater than some !&toff, 0e. print the !o&nt. For
testing, 0e. fi1 the !&toff at 23 :ater on 0e. ma-e this a !omman%/ine option=.
7ere.s the !o%e to pro!ess the %i!tionary:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
221 Chapter @: Fies, programs, an% &ser inp&t
count/cutoff B 23
for :mer count in :mer/counts.items(&:
if count - count/cutoff:
print(:mer C , : , C str(count&&
4n% here.s the o&tp&t 0e get:
agagat : 2%
agcggg : 2%
atcgga : 2$
aaggag : 2$
cccagc : 24
aggttc : 2$
agatta : 24
tctagg : 24
gagtgg : 2G
ccggtt : 2%
gagcag : 24
ttctga : 2%
agatgg : 24
tctgaa : 24
gcgggt : 2$
ttcaaa : 2$
gattaa : 2$
ccagcg : 2$
ggacgt : 27
atggct : 24
;eary %one. $he fina step is to repa!e the har%/!o%e% va&es for the -/mer siGe
an% the !o&nt !&toff 0ith va&es rea% from the !omman% ine. ,e 9&st have to
import the sys mo%&e, an% !onvert the arg&ments to n&mbers &sing the int
f&n!tion. 4s spe!ifie% in the e1er!ise %es!ription, the first !omman% ine arg&ment
is the -/mer siGe an% the se!on% is the !&toff. 7ere.s the fina !o%e 0ith !omments:
222 Chapter @: Fies, programs, an% &ser inp&t
import os
import sys

4 con0ert command line arguments to 0ariables
:mer/siEe B int(sys.arg0'"*&
count/cutoff B int(sys.arg0'2*&

4 define the function to split dna
def split/dna(dna :mer/siEe&:
:mers B '*
for start in range(#len(dna&F(:mer/siEeF"&"&:
:mer B dna'start:startC:mer/siEe*
:mers.append(:mer&
return :mers

4 create an empty dictionary to hold the counts
:mer/counts B UV

4 process each file 2ith the right name
for file/name in os.listdir(,.,&:
if file/name.ends2ith(,.dna,&:
dna/file B open(file/name&

4 process each @<! seAuence in a file
for line in dna/file:
dna B line.rstrip(,.n,&

4 increase the count for each :Fmer that 2e find
for :mer in split/dna(dna :mer/siEe&:
current/count B :mer/counts.get(:mer #&
ne2/count B current/count C "
:mer/counts':mer* B ne2/count

4 print :Fmers 2hose counts are abo0e the cutoff
for :mer count in :mer/counts.items(&:
if count - count/cutoff:
print(:mer C , : , C str(count&&
223 Chapter @: Fies, programs, an% &ser inp&t
;o0 0e !an spe!ify the -/mer ength on the !omman% ine 0hen 0e r&n the
program. ,ith a -/mer ength of F an% a !&toff of 2C:
python :mer/counting.py % 2$
2e get the output
agagat : 2%
agcggg : 2%
gagtgg : 2G
ccggtt : 2%
ttctga : 2%
ggacgt : 27
,ith a -/mer ength of 3 an% a !&toff of @00:
python :mer/counting.py 3 J##
tct : J#G
ttc : J24
gtt : J#$
gat : J#4
gga : J"#
atc : J#$