PERL AND BIOPERL

CONTROL STRUCTURES

´ifµ statement - first style 

if ($porridge_temp < 40) { print ´too hot.\nµ; } elsif ($porridge_temp > 150) { print ´too cold.\nµ; } else { print ´just right\nµ; }

CONTROL STRUCTURES

´ifµ statement - second style 


statement if condition;
print

´\$index is $indexµ if $DEBUG;

Single statements only  Simple expressions only

´unlessµ is a reverse ´ifµ 

statement unless condition;
print

´millennium is here!µ unless $year < 2000;

CONTROL STRUCTURES

´forµ loop - first style 

for (initial; condition; increment) { code }
for

($i=0; $i<10; $i++) { print ´hello\nµ;

}

´forµ loop - second style 

for [variable] (range) { code }
for

$name (@employees) { print ´$name is an employee.\nµ;

}

THE FOR STATEMENT

Syntax
for (START; STOP; ACTION) { BODY } y Initially execute START statements once. y Repeatedly execute BODY until STOP is false. y Execute ACTION after each iteration.

Example
for ($i=0; $i<10; $i++) { print(³Iteration: $i\n´); }

THE FOREACH STATEMENT

Syntax
foreach SCALAR ( ARRAY ) { BODY } y Assign ARRAY element to SCALAR. y Execute BODY. y Repeat for each element in ARRAY.

Example
asTmp = qw(One Two Three); foreach $s (@asTmp){$s .= ³sy ´;} print(@asTmp); # Onesy Twosy Threesy

CONTROL STRUCTURES

´whileµ loop 

while (condition) { code }
$cars

= 7; while ($cars > 0) { print ´cars left: µ, $cars--, ´\nµ; } while ($game_not_over) {«}

CONTROL STRUCTURES

´untilµ loop is opposite of ´whileµ 

until (condition) { code }
$cars

= 7; until ($cars <= 0) { print ´cars left: µ, $cars--, ´\nµ; } while ($game_not_over) {«}

CONTROL STRUCTURES

Bottom-check Loops
do { code } while (condition);  do { code } until (condition); 

$value

= 0;

do { print ´Enter Value: µ; $value = <STDIN>; } until ($value > 0);

SUBROUTINES (FUNCTIONS)

Defining a Subroutine
sub name { code }  Arguments passed in via ´@_µ list 

sub

multiply { my ($a, $b) = @_; return $a * $b;

} 

Last value processed is the return value (could have left out word ´returnµ, above)

SUBROUTINES (FUNCTIONS)

Calling a Subroutine
subname; # no args, no return value  subname (args);  retval = &subname (args);  The ´&µ is optional so long as« 

subname

is not a reserved word subroutine was defined before being called

SUBROUTINES (FUNCTIONS)

Passing Arguments
Passes the value  Lists are expanded 

@a

= (5,10,15); @b = (20,25); &mysub(@a,@b);  this passes five arguments: 5,10,15,20,25  mysub can receive them as 5 scalars, or one array

SUBROUTINES (FUNCTIONS)

Examples
sub good1 { my($a,$b,$c) = @_; } &good1 (@triplet);  sub good2 { my(@a) = @_; } &good2 ($one, $two, $three); 

DEALING WITH HASHES

keys( ) 

- get an array of all keys - get an array of all values - get key/value pairs

foreach (keys (%hash)) { « } @array = values (%hash); while (@pair = each(%hash)) { print ´element $pair[0] has $pair[1]\nµ; }

values( ) 

each( ) 

DEALING WITH HASHES

exists( ) 

- check if element exists - delete one element

if (exists $ARRAY{$key}) { « } delete $ARRAY{$key};

delete( ) 

OTHER USEFUL FUNCTIONS
push( ), pop( )- stack operations on lists shift( ),unshift( ) - bottom-based ops split( ) - split a string by separator

@parts = split(/:/,$passwd_line);  while (split) « # like: split (/\s+/, $_) 

splice( ) substr( )

- remove/replace elements - substrings of a string

STRING MANIPULATION
chop chop(VARIABLE) chop(LIST) index(STR, SUBSTR, POSITION) index(STR, SUBSTR) length(EXPR)

STRING MANIPULATION (CONT.)
substr(EXPR, OFFSET, LENGTH) substr(EXPR, OFFSET) Example: string.pl

PATTERN MATCHING

See if strings match a certain pattern
syntax: string =~ pattern  Returns true if it matches, false if not.  Example: match ´abcµ anywhere in string: 

if ($str =~ /abc/) { « } 

But what about complex concepts like:
between 3 and 5 numeric digits optional whitespace at beginning of line

PATTERN MATCHING

Regular Expressions are a way to describe character patterns in a string 
 

Example: match ´johnµ or ´jonµ

/joh?n/ /\$\d+\.\d\d/ /\d?\d:\d\d(:\d\d)? (AM|PM)?/i

Example: match money values

Complex Example: match times of the day

PATTERN MATCHING 

Symbols with Special Meanings
period . - any single character char set [0-9a-f] - one char matching these Abbreviations  \d - a numeric digit [0-9]  \w - a word character [A-Za-z0-9_]  \s - whitespace char [ \t\n\r\f]  \D, \W, \S - any character but \d, \w, \s  \n, \r, \t - newline, carriage-return, tab  \f, \e - formfeed, escape  \b - word break

PATTERN MATCHING 

Symbols with Special Meanings
asterisk * plus sign + question mark ? carat ^ dollar sign $ quantity {n,m}


[A-Z]{2,4}

- zero or more occurrences - one or more occurrences - zero or one occurrences - anchor to begin of line - anchor to end of line - between n and m occurrences (inclusively) means ´2, 3, or 4 uppercase lettersµ.

PATTERN MATCHING

Ways of Using Patterns 

Matching
if ($line =~ /pattern/) { « } also written: m/pattern/




Substitution

$name =~ s/ASU/Arizona State University/; $command =~ tr/A-Z/a-z/; # lowercase it

Translation

COMMAND LINE ARGS
$0 = program name @ARGV array of arguments to program zero-based index (default for all arrays) Example


yourprog -a somefile
$0 is ´yourprogµ $ARGV[0] is ´-aµ $ARGV[1] is ´somefileµ

BASIC FILE I/O

Reading a File 

open (FILEHANDLE, ´$filenameµ) || die \ ´open of $filename failed: $!µ; while (<FILEHANDLE>) { chop $_; # or just: chop; print ´$_\nµ; } close FILEHANDLE;

BASIC FILE I/O

Writing a File 

open (FILEHANDLE, ´>$filenameµ) || die \ ´open of $filename failed: $!µ; while (@data) { print FILEHANDLE ´$_\nµ; # note, no comma! } close FILEHANDLE;

BASIC FILE I/O

Predefined File Handles
<STDIN>  <STDOUT>  <STDERR> 

input output output ARGV or STDIN

print STDERR ´big bad error occurred\nµ; 

<>

READING WITH <>

Reading from File
y

$input = <MYFILE> ; $input = <> ;

Reading from Command Line
y

Reading from Standard Input
$input = <> ; y $input = <STDIN> ;
y

READING WITH <> (CONT.)

Reading into Array Variable
@an_array = <MYFILE> ; y @an_array = <STDIN> ; y @an_array = <> ;
y

PACKAGES
Collect data & functions in a separate (´privateµ) namespace Reusable code

PACKAGES

Access packages by file name or path:
require ´getopts.plµ;  require ´/usr/local/lib/perl/getopts.plµ;  require ´../lib/mypkg.plµ; 

PACKAGES
Command: package pkgname;  Stays in effect until next ´packageµ or end of block { « } or end of file.  Default package is ´mainµ 

PACKAGES

Package name in variables 

$pkg::counter = 0;

Package name in subroutines
sub pkg::mysub ( ) { « }  &pkg::mysub($stuff); 

Old syntax in Perl 4 

sub pkg·mysub ( ) { « }

PACKAGES
# # Get Day Of Month Package # package getDay; sub main::getDayOfMonth { local ($sec, $min, $hour, $mday) = localtime; return $mday; } 1; # otherwise ´requireµ or ´useµ would fail

PACKAGES

Calling the package 

require ´/path/to/getDay.plµ; $day = &getDayOfMonth;

In Perl 5, you can leave off ´&µ for previously defined functions: 

$day = getDayOfMonth;

WHAT ARE PERL MODULES?
Modules are collections of subroutines Encapsulate code for a related set of processes End in .pm so Foo.pm would be used as Foo Can form basis for Objects in Object Oriented programming

USING A SIMPLE MODULE
List::Util is a set of List utilities functions Read the perldoc to see what you can do Follow the synopsis or individual function examples

LIST::UTIL
List::Util; my @list = 10..20; my $sum = List::Util::sum(@list); print ³sum (@list) is $sum\n´;
use

List::Util qw(shuffle sum); my $sum = sum(@list); my @list = (10,10,12,11,17,89); print ³sum (@list) is $sum\n´;
use

my @shuff = shuffle(@list); print ³shuff is @shuffle\n´;

MODULE NAMING
Module naming is to help identify the purpose of the module The symbol :: is used to further specify a directory name, these map directly to a directory structure List::Util is therefore a module called Util.pm located in a directory called ¶List·

(MORE) MODULE NAMING
Does not require inheritance or specific relationship between modules that all start with the same directory name Case MaTTerS! List::util will not work Read more about a module by doing ´perldoc Modulenameµ

MODULES AS OBJECTS
Modules are collections of subroutines Can also manage data (aka state) Multiple instances can be created (instantiated) Can access module routines directly on object

OBJECT CREATION
To instantiate a module call ¶new· Sometimes there are initialization values Objects are registered for cleanup when they are set to undefined (or when they go out of scope) Methods are called using -> because we are dereferencing object.

SIMPLE MODULE AS OBJECT EXAMPLE
#!/usr/bin/perl

-w

use strict; use MyAdder; my $adder = new MyAdder; $adder->add(10); print $adder->value, ³\n´; $adder->add(10); print $adder->value, ³\n´; my $adder2 = new MyAdder(12); $adder2->add(17); print $adder2->value, ³\n´; my $adder3 = MyAdder->new(75); $adder3->add(7); print $adder3->value, ³\n´;

WRITING A MODULE: INSTANTIATION

Starts with package to define the module name
y

multiple packages can be defined in a single module file - but this is not recommended at this stage

The method name new is usually used for instantiation
y

bless is used to associate a datastructre with an object

WRITING A MODULE: SUBROUTINES

The first argument to a subroutine from a module is always a reference to the object - we usually call it ¶$self· in the code.

This is an implicit aspect Object-Oriented Perl Write subroutines just like normal, but data associated with the object can be accessed through the $self reference.

WRITING A MODULE
MyAdder; use strict; sub new { my ($package, $val) = @_; $val ||= 0; my $obj = bless { µvalue¶ => $val}, $package; return $obj; }
sub package

add { my ($self,$val) = @_; $self->{¶value¶} += $val;

} sub value { my $self = shift; return $self->{¶value¶}; }

WRITING A MODULE II (ARRAY)
MyAdder; use strict; sub new { my ($package, $val) = @_; $val ||= 0; my $obj = bless [$val], $package; return $obj; }
sub package

add { my ($self,$val) = @_; $self->[0] += $val;

} sub value { my $self = shift; return $self->[0]; }

USING THE MODULE
Perl has to know where to find the module Uses a set of include paths

y

type perl -V and look at the @INC variable

Can also add to this path with the PERL5LIB environment variable Can also specify an additional library path in script use lib µ/path/to/lib¶;

USING A MODULE AS AN OBJECT
LWP is a perl library for WWW processing Will initialize an ¶agent· to go out and retrieve web pages for you Can be used to process the content that it downloads

LWP::USERAGENT
#!/usr/bin/perl

-w

use strict; use LWP::UserAgent; my $url = 'http://us.expasy.org/uniprot/P42003.txt'; my $ua = LWP::UserAgent->new(); # initialize an object $ua->timeout(10); # set the timeout value my $response = $ua->get($url);
if

#

($response->is_success) { print $response->content; # or whatever if( $response->content =~ /DE\s+(.+)\n/ ) { print "description is '$1'\n"; } if( $response->content =~ /OS\s+(.+)\n/ ) { print "species is '$1'\n"; }

} else { die $response->status_line; }

OVERVIEW OF BIOPERL TOOLKIT

Bioperl is...
A Set of Perl modules for manipulating gnomic and other biological data y An Open Source Toolkit with many contributors y A flexible and extensible system for doing bioinformatics data manipulation
y

SOME THINGS YOU CAN DO
Read in sequence data from a file in standard formats (FASTA, GenBank, EMBL, SwissProt,...) Manipulate sequences, reverse complement, translate coding DNA sequence to protein. Parse a BLAST report, get access to every bit of data in the report Dr. Mikler will post some detailed tutorials

MAJOR DOMAINS COVERED
Sequences, Features, Annotations, Pairwise alignment reports Multiple Sequence Alignments Bibliographic data Graphical Rendering of sequence tracks Database for features and sequences

ADDITIONAL DOMAINS
Gene prediction parsers Trees, Parsing Phylogenetic and Molecular Evolution software output Population Genetic data and summary statistics Taxonomy Protein Structure

SEQUENCE FILE FORMATS

Simple formats - without features
y

FASTA (Pearson), Raw, GCG

Rich Formats - with features and annotations
GenBank, EMBL y Swissprot, GenPept y XML - BSML, GAME, AGAVE, TIGRXML, CHADO
y

PARSING SEQUENCES

Bio::SeqIO
y

multiple drivers: genbank, embl, fasta,...

Sequence objects
Bio::PrimarySeq y Bio::Seq y Bio::Seq::RichSeq
y

LOOK AT THE SEQUENCE OBJECT

Common (Bio::PrimarySeq) methods
y y y y y y y

seq() - get the sequence as a string length() - get the sequence length subseq($s,$e) - get a subsequence translate(...) - translate to protein [DNA] revcom() - reverse complement [DNA] display_id() - identifier string description() - description string

DETAILED LOOK AT SEQS WITH
ANNOTATIONS

Bio::Seq objects have the methods
y y y y

add_SeqFeature($feature) - attach feature(s) get_SeqFeatures() - get all the attached features. species() - a Bio::Species object annotation() - Bio::Annotation::Collection

FEATURES
Bio::SeqFeatureI - interface Bio::SeqFeature::Generic - basic implementation SeqFeature::Similarity - some score info SeqFeature::FeaturePair - pair of features

SEQUENCE FEATURES

Bio::SeqFeatureI - interface - GFF derived y start(), end(), strand() for location information
y

location() - Bio::LocationI object (to represent complex locations) score,frame,primary_tag, source_tag - feature information spliced_seq() - for attached sequence, get the sequence spliced.

y

y

SEQUENCE FEATURE (CONT.)

Bio::SeqFeature::Generic
y y y y

add_tag_value($tag,$value) - add a tag/value pair get_tag_value($tag) - get all the values for this tag has_tag($tag) - test if a tag exists get_all_tags() - get all the tags

ANNOTATIONS
Each Bio::Seq has a Bio::Annotation::Collection via $seq->annotation() Annotations are stored with keys like ¶comment· and ¶reference· @com=$annotation-> get_Annotations(¶comment¶) $annotation-> add_Annotation(¶comment¶,$an)

ANNOTATIONS

Annotation::Comment
y

comment field

Annotation::Reference
y

author,journal,title, etc

Annotation::DBLink
y

database,primary_id,optional_id,comment

Annotation::SimpleValue

CREATE A SEQUENCE OUT OF THIN AIR
Bio::Seq; my $seq = Bio::Seq->new(-seq => µATGGGTA¶, -display_id => µMySeq¶, -description => µa description¶); print ³base 4 is ³, $seq->subseq(4,5), ³\n´; print ³my whole sequence is ³,$seq->seq(), ³\n´; print ³reverse complement is ³, $seq->revcom->seq(), ³\n´;
use

READING IN A SEQUENCE
Bio::SeqIO; my $in = Bio::SeqIO->new(-format => µgenbank¶, -file => µfile.gb¶); while( my $seq = $in->next_seq ) {
use

print ³sequence name is ³, $seq->display_id, ³ length is ´,$seq->length,´\n´; print ³there are ³,(scalar $seq->get_SeqFeatures), ³ features attached to this sequence and ³, scalar $seq->annotation>get_Annotations(¶reference¶), ³ reference annotations\n´; }

WRITING A SEQUENCE
Bio::SeqIO; # Let¶s convert swissprot to fasta format my $in = Bio::SeqIO->new(-format => µswiss¶, -file => µfile.sp¶); my $out = Bio::SeqIO->new(-format => µfasta¶, -file => µ>file.fa¶);` while( my $seq = $in->next_seq ) { $out->write_seq($seq); }
use

A DETAILED LOOK AT BLAST PARSING

3 Components
Result: Bio::Search::Result::ResultI y Hit: Bio::Search::Hit::HitI y HSP: Bio::Search::HSP::HSPI
y

BLAST PARSING SCRIPT
use Bio::SearchIO; my $cutoff = ¶0.001¶; my $file = µBOSS_Ce.BLASTP¶, my $in = new Bio::SearchIO(-format => µblast¶, -file => $file); while( my $r = $in->next_result ) { print "Query is: ", $r->query_name, " ", $r->query_description," ",$r->query_length," aa\n"; print " Matrix was ", $r->get_parameter(¶matrix¶), "\n"; while( my $h = $r->next_hit ) { last if $h->significance > $cutoff; print "Hit is ", $h->name, "\n"; while( my $hsp = $h->next_hsp ) { print " HSP Len is ", $hsp->length(¶total¶), " ", " E-value is ", $hsp->evalue, " Bit score ", $hsp->score, " \n", " Query loc: ",$hsp->query->start, " ", $hsp->query->end," ", " Sbject loc: ",$hsp->hit->start, " ", $hsp->hit->end,"\n"; } } }

BLAST Report
Copyright (C) 1996-2000 Washington University, Saint Louis, Missouri USA. All Rights Reserved. Reference: Query= Gish, W. (1996-2000) http://blast.wustl.edu

BOSS_DROME Bride of sevenless protein precursor. (896 letters)

Database:

wormpep87 20,881 sequences; 9,238,759 total letters. Searching....10....20....30....40....50....60....70....80....90....100% done Smallest Sum Probability P(N) N 1 1 1

Sequences producing High-scoring Segment Pairs: F35H10.10 CE24945 status:Partially_confirmed TR:Q20073... M02H5.2 CE25951 status:Predicted TR:Q966H5 protein_id:... ZC506.4 CE01682 locus:mgl-1 metatrophic glutamate recept... ««

High Score

182 4.9e-11 86 0.15 91 0.18

USING THE SEARCH::RESULT OBJECT
Bio::SearchIO; use strict; my $parser = new Bio::SearchIO(-format => µblast¶, -file => µfile.bls¶); while( my $result = $parser->next_result ){ print ³query name=³, $result->query_name, ³ desc=´, $result->query_description, ³, len=´,$result->query_length,³\n´; print ³algorithm=³, $result->algorithm, ³\n´; print ³db name=´, $result->database_name, ³ #lets=´, $result->database_letters, ³ #seqs=´,$result->database_entries, ³\n´; print ³available params ³, join(¶,¶, $result->available_parameters),´\n´; print ³available stats ³, join(¶,¶, $result->available_statistics), ³\n´; print ³num of hits ³, $result->num_hits, ³\n´; }
use

USING THE SEARCH::HIT OBJECT
Bio::SearchIO; use strict; my $parser = new Bio::SearchIO(-format => µblast¶, -file => µfile.bls¶); while( my $result = $parser->next_result ){ while( my $hit = $result->next_hit ) { print ³hit name=´,$hit->name, ³ desc=´, $hit->description, ³\n len=´, $hit->length, ³ acc=´, $hit->accession, ´\n´; print ³raw score ³, $hit->raw_score, ³ bits ³, $hit->bits, ³ significance/evalue=³, $hit->evalue, ³\n´; } }
use

TURNING BLAST INTO HTML
use Bio::SearchIO; use Bio::SearchIO::Writer::HTMLResultWriter; my $in = new Bio::SearchIO(-format => 'blast', -file => shift @ARGV); my $writer = new Bio::SearchIO::Writer::HTMLResultWriter(); my $out = new Bio::SearchIO(-writer => $writer -file => ³>file.html´); $out->write_result($in->next_result);

TURNING BLAST INTO HTML
# to filter your output my $MinLength = 100; # need a variable with scope outside the method sub hsp_filter { my $hsp = shift; return 1 if $hsp->length('total') > $MinLength; } sub result_filter { my $result = shift; return $hsp->num_hits > 0; } my $writer = new Bio::SearchIO::Writer::HTMLResultWriter (-filters => { 'HSP' => \&hsp_filter} ); my $out = new Bio::SearchIO(-writer => $writer); $out->write_result($in->next_result); # can also set the filter via the writer object $writer->filter('RESULT', \&result_filter);

CUSTOM URL LINKS @args = ( -nucleotide_url =>

-protein_url ); my $processor = new Bio::SearchIO::Writer::HTMLResultWriter(@args); $processor->introduction(\&intro_with_overview); $processor->hit_link_desc(\&gbrowse_link_desc); $processor->hit_link_align(\&gbrowse_link_desc); sub intro_with_overview { my ($result) = @_; my $f = &generate_overview($result,$result>{"_FILEBASE"}); $result->rewind(); return sprintf( qq{ <center> <b>Hit Overview<br> Score: <font color="red">Red= (&gt;=200)</font>, <font color="purple">Purple 200-80</font>, <font color="green">Green 80-50</font>, <font color="blue">Blue 50-40</font>, <font color="black">Black &lt;40</font>

$gbrowsedblink, => $gbrowsedblink

MULTIPLE SEQUENCE ALIGNMENTS
Bio::AlignIO to read alignment files Produces Bio::SimpleAlign objects Interface and objects designed for round-tripping and some functional work Could really use an overhaul or a parallel MSA representation

GETTING SEQUENCES FROM GENBANK
Through Web Interface Bio::DB::GenBank (don·t abuse!!) Alternative is to download all of genbank, index with Bio::DB::Flat (will be much faster in long run)

SIMPLE SEQUENCE RETRIEVAL
use my

Bio::Perl;

$seq = get_sequence(¶genbank¶,$acc); ³I got a sequence $seq for $acc\n´;

print

SEQUENCE RETRIEVAL SCRIPT
#!/usr/bin/perl -w use strict; use Bio::DB::GenPept; use Bio::DB::GenBank; use Bio::SeqIO; my $db = new Bio::DB::GenPept(); # my $db = new Bio::DB::GenBank(); # if you want NT seqs # use STDOUT to write sequences my $out = new Bio::SeqIO(-format => 'fasta'); my $acc = µAB077698¶; my $seq = $db->get_Seq_by_acc($acc); if( $seq ) { $out->write_seq($seq); } else { print STDERR "cannot find seq for acc $acc\n"; } $out->close();

SEQUENCE RETRIEVAL FROM LOCAL DATABASE
use Bio::DB::Flat; my $db = new Bio::DB::Flat(-directory => µ/tmp/idx¶, -dbname => µswissprot¶, -write_flag => 1, -format => µfasta¶, -index => µbinarysearch¶); $db->make_index(¶/data/protein/swissprot¶); my $seq = $db->get_Seq_by_acc(¶BOSS_DROME¶);

Sign up to vote on this title
UsefulNot useful