You are on page 1of 14

Sequence alignment

In general, alignment of sequences can be defined as mapping between the residues of two or more sequences, that is,
finding the residues that are related to each other due to phylogenetic or functional reasons. One of the main goals of
sequence alignment is to determine whether sequences display sufficient similarity to conclude about their homology.
However, similarity is not equivalent of homology. Similarity (of sequences) is a quantity that might be measured in e.g.
percent identity. Homology means a common evolutionary history.

In principle, all alignment methods try to follow the molecular mechanisms of sequence evolution, namely substitutions,
deletions and insertions of monomers:

AGTCA AGTCA AGT-CA


! ! !
AGACA AG-CA AGTACA
(substitution) (deletion) (insertion)

An alignment that spans complete sequences is called global alignment, an alignment of sequence fragments - local
alignment. Depending on a biological problem, either global or local alignment is more efficient to detect a similarity
between sequences. For instance, global alignments are advisable in case of relatively closely related sequences. Local
alignments are a powerful tool to localize the most important regions of similarity, e.g. functional motifs, that may be
flanked by variable regions that are less important for functioning. Obviously, local alignments should be applied in
cases of mosaic arrangement of protein domains or for comparisons between spliced mRNAs and genomic sequences.
Alignment tasks can be subdivided into pairwise alignment (two sequences), multiple alignment (more than two) and
database similarity search (searching for the sequences in the database that are similar to a given sequence).

Pairwise alignment

For two sequences, many different alignments are possible (for instance, about 10179 different alignments for two
sequences of length 300). Obviously, an alignment with relatively high number of matching residues in two sequences
and relatively small number of deletions/insertions is the most likely to have biological meaning, because it assumes
smaller number of evolutionary events. The problem to find such an alignment is equivalent to finding the “best” path in a
dot matrix, where two sequences are plotted at the coordinates of a two-dimensional graph and dots indicate matching
residues. Qualitatively, the best global alignment would be a path visiting as many dots as possible with a preference for
diagonal moves as compared to horizontal or vertical moves (because the latter represent deletions/insertions in the
sequences). Dot matrices are also very demonstrative for visualizing alignments, especially after “filtering” local regions
of low similarity, for instance, leaving only diagonals of consecutive similar residues.

Sequence 2 Sequence 2
A G C T A G G A G A G C T A G G A G
A . . . A .
G . . . . G .
C . C .
G . . . . G .
Seq.1 G . . . . G .
A . . . A .
G . . . . G . .
A . . . A .
G . . . . G .

(filtering out similarities of less than 3 consecutive residues)

The example shown above suggests two alternative alignments:

AGCTAGGAG AGCTAGGAG-- seq.1


||| ||| or ||| ||||
AGCGGAGAG AGC--GGAGAG seq.2

The second alignment has more matching residues than the first one, but it contains deletions. Obviously, the selection
of the best alignment depends on scoring system for matches, mismatches and deletions/insertions. Thus, finding the
optimal alignment requires:
(1) a definition of scores for matches, mismatches and deletions/insertions;
(2) an algorithm to find the best score.
Substitution matrices

A simple match/mismatch scoring (such as 1 for a match, 0 for mismatch) is not the most effective, especially in proteins,
where “conservative substitutions” of amino acids with similar properties are relatively frequent. For instance,
substitutions of functionally important polar residues by polar ones or mutual substitutions of hydrophobic residues, e.g.
Arg → Lys or Val → Ile. Therefore, the idea of substitution matrix (Dayhoff et al., 1978) has been introduced. Such a
matrix (dimensions 20×20 with diagonal symmetry) defines different scores for residue matches and all possible
substitutions as well. The score values can be estimated from the previously aligned sequences with trusted alignments.

An efficient way to introduce such scores is a log-odds calculation: a score proportional to the logarithm of the ratio of
observed substitution frequency (target frequency) to the background frequency which is determined by chance. Thus,
the score for a substitution of two residues a and b is

S(a,b) ~ log [ pabobserved / ] = log [ pabobserved / ],


pabbackground a fb

where pabobserved is the probability (frequency) of a/b substitution observed in used alignments, and f a and fb are the amino
acid occurencies. The logarithm is used for calculations because adding such scores for positions in any new alignment
would be equivalent to estimating a likelihood of the hypothesis that the alignment reflects sequence homology and has a
non-random pattern (this likelihood is equal to the product of likelihoods of non-random aligning of all residue pairs, if the
pairs are statistically independent).

The first widely used substitution matrices were based on the point-accepted-mutation (PAM) evolutionary model. In
this model, a unit of protein divergence (1 PAM) was introduced, corresponding to a change of 1% of amino acid
sequence. It is very important to note that 100 PAM change does not lead to a complete change of the sequence,
because some residues may change several times while others will not be substituted at all. Alignments of very closely
related sequences (trusted alignments in any scoring system) were used to estimate the target frequencies
corresponding to 1 PAM. These values can be extrapolated to more distant sequences, e.g. PAM250 (the matrix
originally published by M. Dayhoff et al., 1978).

A different strategy was used for so-called BLOSUM matrices. They were derived from alignments contained in the
database BLOCKS of local multiple alignments. Here the frequencies were computed directly from alignments. In
contrast to PAM model, relatively distant sequences were considered, to avoid statistical biases due to
overrepresentation of some sequences. Thus, BLOSUM62 matrix has been computed from the sequences having
maximum 62% identity (sequences with higher identities were merged into single strings). The performance of
BLOSUM62 turned out to be very good, and this is a standard matrix for many protein alignment programs.

The BLOSUM62 scoring matrix:

A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
For DNA, the 4×4 matrices are relatively more simple, for instance, BLASTN program is using a scoring system with the
score +2 for a match, and -1 for a mismatch. Sometimes the differences between more frequent purine-purine or
pyrimidine-pyrimidine transitions (A — G, C — T ) as compared to transversions (A — C, A —T, G — T etc.) are taken into
account. For instance, a matrix used in BLASTZ, a program for comparisons of large genomic sequences:

A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91

Gap penalties

Locations of insertions/deletions in the alignments are called gaps. In attempt to minimize the number of gaps, some
negative scores (penalties) are introduced. The penalty values are subtracted from the scores calculated in non-gapped
regions of alignments. The most common formula for a gap penalty is

S(gap) = G + L.n,

where G is the gap-opening penalty (also called gap existence cost), L is the gap-extension penalty (gap extension cost)
and n is the number of residues in the gap (the gap length). There is no solid theory here and the choice of values is
highly empirical. In the combination with BLOSUM62 substitution matrix, the values of G around 10-15 and L around 1-2
turned out to perform rather well.

Algorithms for finding the optimal alignment

The problem of finding the optimal alignment of two sequences (given a defined scoring system) has been solved using a
dynamic programming algorithm. Dynamic programming approach is used in various complex optimization problems
when solution of a problem can be reduced to a subproblem solution with some recursive calculation for the problem
solution on the basis of the subproblem. In case of the alignment problem, formulated as finding the “best” path in a
matrix (using scores for matches/mismatches and gap penalties), a very important observation is that any partial subpath
of the optimal one is also the optimal path for the alignment of two subsequences. The main recursive definition for
alignment of two (sub)sequences of the length m and n residues is as follows: the optimal score S(m,n) is the best value
out of three scores:

S(m-1, n-1) + (score of aligning residues m and n),


S(m-1, n) + (gap penalty),
S(m, n-1) + (gap penalty).

Using this definition is possible to both build a matrix of the best scores for all possible subalignments (called dynamic
programming matrix) and to backtrack the optimal alignment for the full sequences. The dynamic programming matrix is
filled starting from the smallest subsequences using “bottom-up” approach. After completing the matrix, it is possible to
restore the sequence of calculations that has led to the final score.
An example (from S. Eddy, 2004):
two DNA sequences X = TTCATA and Y = TGCTCGTA. The scores: +5 for a match, -2 for a mismatch, -6 for each
insertion or deletion. The dynamic programming matrix is filled as follows:

0 T G C T C G T A (Y)
0 0 -6 -12 -18 -24 -30 -36 -42 -48 [alignment to “initialising” 0]
T -6 5 -1 -7 -13 -19 -25 -31 -37 [ S(1,1)=5 (T-T match) ]
T -12 -1
X C -18 -7 other values
A -24 -13 correspond to gaps
T -30 -19
A -36 -25

S(2,2) = max (5-2; -1-6; -1-6) = 3


S(2,3) = max (-1-2; 3-6; -7-6) = -3
S(2,4) = max (-7+5; -3-6; -13-6) = -2 etc.

S(3,3) = max (3+5; -3-6; -3-6) = 8


S(3,4) = max (-3-2; 8-6; -2-6) = 2
S(3,5) = max (-2+5; 2-6; -8-6) = 3 etc.


0 T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
T -6 5 -1 -7 -13 -19 -25 -31 -37
T -12 -1 3 -3 -2 -8 -14 -20 -26
C -18 -7 -3 8 2 3
A -24 -13 -9
T -30 -19 -15
A -36 -25 -21


...
complete dynamic programming matrix:

0 T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
T -6 5 -1 -7 -13 -19 -25 -31 -37
T -12 -1 3 -3 -2 -8 -14 -20 -26
C -18 -7 -3 8 2 3 -3 -9 -15
A -24 -13 -9 2 6 0 1 -5 -4
T -30 -19 -15 -4 7 4 -2 6 0
A -36 -25 -21 -10 1 5 2 0 11

The final best score S(6,8)=11 has been computed (backtracking) as follows:

S(6,8)← S(5,7)← S(4,6)← S(3,5)← S(2,4)← S(1,3)← S(1,2)← S(1,1)


(+5) (+5) (-2) (+5) (+5) (-6) (-6)
11 6 1 3 -2 -7 -1 5

The backtracking allows one to reconstruct the optimal alignment:

T--TCATA sequence X
TGCTCGTA sequence Y

Such an algorithm, given a scoring system, guarantees to find the optimal global alignment of two sequences. The
algorithm is often called Needleman-Wunsch algorithm by names of the authors (1970). An extension of this strategy,
the Smith-Waterman algorithm (1981) has been developed to find the optimal local alignment. A locally optimal
alignment is defined as the one that cannot be improved either by extending or shortening the alignment regions. Both
global and local optimal alignments can be computed, for instance, by a tool align (www.ebi.ac.uk/Tools/emboss/align/):
algorithms needle and water.
Estimates of statistical significance for alignments

Given a scoring system and a (dynamic programming) algorithm for optimal alignment, any two sequences can be
aligned. However, there is no guarantee that this alignment is a consequence of homology between the sequences.
Apparently, this is likely to be true if the obtained alignment would be very unlikely to be determined by chance alone.
Therefore, an estimate of statistical significance is very important.

For global alignments, there is no mathematical theory to estimate a score that can be expected by chance.
Significance of a global alignment can be estimated by comparison with statistics yielded by alignments of randomised
(permuted) sequences. For instance, in order to estimate a significance for the alignment of two sequences X and Y of
lengths m and n residues, respectively, one can generate 100 sequences of m residues and 100 sequences of the length
n by “reshuffling” of residues. It is important that these sequences have the same sizes and residue composition as
original ones, because the statistics depend on these parameters. Thus, 100 “random” alignments can be produced and
their scores can be compared with the score of the real alignment. Still, a quantitative estimate of significance is not very
straightforward, because the distribution of scores in random alignment is not normal. Qualitatively, it is possible, for
instance, to estimate the probability to get a given alignment by chance (P-value) as less than 0.01 if 100 random
alignments yielded the scores that are less than the alignment of interest.

For ungapped local alignments the statistical estimates can be directly calculated. For two random sequences of
lengths m and n (sufficiently large), the probability to find at least one gap-free alignment (called high-scoring segment
pair, HSP) with a score at least S (P-value) is equal to

P(S) = 1- exp [ - Kmn exp ( - λS) ],

where K and λ depend on scoring rules and residue frequencies. The formula is derived from a so-called extreme value
distribution, which is valid for the scores of random HSPs. The expected number HSPs with score at least S (E-value) is

E(S) = Kmn exp ( - λS).

For gapped alignments, a similar theory can be developed, but some large-scale estimates for randomised sequences
are necessary to calculate the parameters used in the analytical formulas.

• The optimal alignment of two sequences (in terms of scoring system) is not always the optimal one in terms of
biological significance. Frequently, the outputs of standard alignment programs require further manual improvement.
For instance, it can be done on the basis of known features in the sequences such as codons or structural motifs.

Multiple sequence alignment

Progressive multiple sequence alignment

Multiple alignment of many sequences is significantly more complicated task than the pairwise alignment. A direct
application of accurate dynamic programming procedure is very demanding computationally and is limited to relatively
small numbers of relatively short sequences. The commonly used approach is progressive multiple sequence
alignment, based on pairwise alignment of the most similar sequences with gradual addition of the more distant ones.
The most widely used is CLUSTAL series of programs.

The progressive alignment procedure is divided into three main parts:

1. All possible pairwise alignments between sequences in a given set. These pairwise alignments are used to calculate a
distance matrix (the pairs of sequences with high similarity scores are assumed to be characterised by short
evolutionary distances).

2. A guide tree is calculated from the distance matrix. The branching order of the guide tree, similar to evolutionary trees,
follows the extent of relatedness of sequences.

3. The sequences are progressively aligned according to the tree branching order, starting from the closely related
sequences. At every step the pairwise alignment is used to align two (clusters of) sequences belonging to some node in
the tree. Alignment of sequence clusters differs from simple pairwise sequence alignment only by calculation of
substitution scores from many residues rather than from two (see also below). Within an already built cluster, the
positions of sequences are not changed, so that the gaps appearing during alignment to another cluster are introduced
simultaneously in all cluster members.
Various modifications are possible at every part of the algorithm, for instance, different ways to convert similarity scores
to distances, clustering techniques to build a tree, approaches to construct consensus sequences etc.

Guide tree. The first versions of CLUSTAL used a procedure of UPGMA (unweighted pair group method with arithmetic
averages) to build the guide tree. This procedure is very simple from both conceptual and computational points of view.
In this method, the first cluster is formed between the most related sequences (say, seq1 and seq2). A distance between
this cluster and any other sequence seqX is calculated as arithmetic average

d(seqX, new cluster) = [ d(seqX, seq1) + d(seqX, seq2) ]


/ 2.

Now a new distance matrix can be calculated and the closest sequences (or clusters) selected. The procedure is
repeated, so as at every step the distance between any two clusters is calculated as arithmetic average of all pairs of
sequences from these clusters. In CLUSTAL W and subsequent packages, the neighbour-joining method (NJ) has
been implemented. This method estimates a divergence along each branch of the tree. It builds unrooted trees by a
“star-decomposition” procedure. At every step a decision to join two sequences (or clusters) is done using a modified
distance matrix that also takes into account a divergence of these two objects from others. The root of the tree is found
by a “mid-point” method: finding a point where the means of branch lengths on two sides of the root are equal.

Sequence weights have been introduced in CLUSTAL W in order to compensate an unequal representation of some
motifs in the dataset: for instance, the presence of a subgroup of very similar sequences while another subgroup has few
representatives. The weights were introduced to up-weight the divergent sequences and down-weight those that are very
similar to other sequences. The weight of a sequence was calculated by adding the values equal to lengths of branches
in the tree between the sequence and the root, divided by numbers of sequences sharing them, for instance, as shown
below:

0.08 seq.1 W(1)=0.205


0.2|
| |0.08 seq.2 W(2)=0.205
|
| 0.1 | 0.05 seq.3 W(3)=0.225
| |0.3|
| |0.06 seq.4 W(4)=0.226
|

| seq.5 W(5)=0.5
0.5

W(1) = 0.08 + (0.2/2) + (0.1/4) = 0.205


W(2) = 0.05 + (0.3/2) + (0.1/4) = 0.225
W(5) = 0.5
The weighting is used for scoring the matching of residues in the alignment process. For a given position in
the alignment, the score is “weighted average” of values M from a substitution matrix. If clusters of sequences
are considered, sequences from one cluster are compared to sequences from another.

For instance, assume aligning a cluster of 4 sequences and a cluster of 2. The score at any position of the
alignment is calculated as weighted average of 4x2=8 matrix values:

seq.1 PEEKSAVTAL Score =
M(T,V)*W(1)*W(5)seq.2 GEEKAAVLAL
+
M(T,I)*W(1)*W(6)
seq.3 PADKTNVKAA + M(L,V)*W(2)*W(5)
seq.4 AADKTNVKAA + M(L,I)*W(2)*W(6)
↑ + M(K,V)*W(3)*W(5)
! +
M(K,I)*W(3)*W(6)seq.5
EGEWQLVLHV +
M(K,V)*W(4)*W(5)
seq.6 AAEKTKIRSA + M(K,I)*W(4)*W(6)

In order to optimize the performance of multiple alignment procedure for sequences of various degrees of
similarity, some flexible definitions of scoring are very useful, such as position-specific gap penalties and
variable substitution matrices. For instance, gap penalties at some position may be decreased if the previous
alignment steps have already identified a gap at this position. The choice of the matrix may be for instance
BLOSUM80 for 80-100% similarities, BLOSUM 62 for 60-80%, BLOSUM45 for 30-60% and BLOSUM30 for
less than 30%. The choice of a suitable matrix isdone automatically at each step of the multiple alignment.

Multiple alignment quality may be esimated in terms of sequence similarity derived from the alignment.
Different scoring systems, so-called objective functions, can be introduced. For instance, for each position
(column) of the alignment a quality score may be calculated as e.g. sum-of-pairs similarity score (according to
a substitution matrix) andthe total alignment score is computed as the sum of all columns.

One of the serious problems in straightforward progressive multiple alignment is the local maximum
problem: due to a“greedy” nature of the alignment strategy, there is no guarantee that the global optimal
solution will be found. Any misaligned region produced early in the alignment process cannot be corrected
later, when new information from other sequences is added. In particular, an incorrect branching order of the
initial tree may significantly contribute to the problem.

Introduction to Linux and basics of Linux systems


Linux is a computer operating system originally developed by Linus Torvalds as a research
project. There is some interesting history about the rapid Linux evolution, but suffice it to
say, Linux has come a long way in a decade. Linux runs on Intel, Mac, Sun, Dec Alpha, and
several other hardware platforms.
Linux Features
• Linux is a full-featured, 32-bit multi-user/multi-tasking OS.
• Linux adheres to the common (POSIX) standards for UNIX .
• Native TCP/IP support.
• A mature X Windows GUI interface.
• Complete development environment. C, C++, Java, editors, version control systems.
• Open Source.

Multi-User Operation
In UNIX and Linux, all interactions with the OS are done through designated “users”, who
each have an identification ID (login name) and a password. UNIX allows different users to
co-exist simultaneously and allows for different levels of users.
The most powerful user is called superuser or “root”, and has access to all files and processes.
The superuser does many of the system management tasks like adding regular users, file
backups, system configuration etc.
Common users accounts, which perform non-system type tasks, have restricted access to
system-sensitive components to protect Linux from being accidentally or purposely damaged.
In a moment you will enter a user account and start exploring the Linux filesystem.
Why Linux?
Linux can operate as a web, file, smb (WinNT), Novell, printer, ftp, mail, SQL, masquerading,
firewall, and POP server to name but a few.
It can act as a graphics, C, C++, Java, Perl, Python, SQL, audio, video, and documentation,
development workstation etc.

Fig: Linux Uses


Linux is a good solution for developers that need a stable and reliable platform that has open
source code. Its not a good system for beginning developers that want a simple GUI interface
to a programming language, although Linux has many GUI software development interfaces.
Linux is ideal as a workstation also, and offers many customizable features not found in any
other platform. It makes a good platform for dedicated workstaions that have limited functions
like in an educational or laboratory environment
The Role and Function of Linux
Application Platform: An operating system provides applications with a platform where they
can run, managing their access to the CPU and system memory.
Hardware Moderator: The operating system also serves as a mediator between running
applications and the system hardware. Most applications are not written to directly address a
computer’s hardware.
Security: The operating system is responsible for providing a degree of security for the data it
hosts.
Connectivity: The operating system manages connectivity between computer systems using a
variety of network media and interfaces, including infrared, Ethernet, and wireless.
Linux as a Server
File Server Using the Network File System (NFS) or Samba service, Linux can be configured
to provide network storage of users’ files.
Print Server Using the Common UNIX Printing System (CUPS) and Samba services together,
Linux can be configured to provide shared printing for network users.
Database Server Linux works great as a database server. There are a variety of database
services available for Linux servers, including MySQL and PostgreSQL.
Web Server Linux is also widely deployed as a Web server. The most popular Web service
currently used on Linux is the Apache Web server.
E-Mail Server There are a variety of different e-mail services available for Linux that can turn
your system into an enterprise-class e-mail server.
Advantage of Linux as a Server
Linux is extremely stable. Simply put, a Linux server rarely crashes. It just keeps running and
running. Linux servers are very fast. Many benchmark tests have been run pitting Linux servers
against other server operating systems. Linux servers are much less expensive. Most other
server operating systems charge expensive per-seat licensing fees, making them very expensive
to deploy in large networks.
Linux Command-Line Interface
Linux shells: A shell is a command interpreter that allows you to type commands from the
keyboard to interact with the operating system kernel.
sh (Bourne Shell) The sh shell was the earliest shell, being developed for UNIX back in the
late 1970s.
bash (Bourne-Again Shell) The bash shell is an improved version of the sh shell and is one
of the most popular shells today. It’s the default shell used by most Linux distributions.
csh (C Shell) The csh shell was originally developed for BSD UNIX. It uses a syntax that is
very similar to C programming.
tsch The tsch shell is an improved version of the C Shell. It is the default shell used on
FreeBSD systems.
zsh (Z Shell) The Z Shell is an improved version of the bash shell.
Commonly Used CLI Commands and Utilities
/ : denote root directory
./ : denote current directory
PATH :
halt This command shuts down the operating system, but can only be run by the root user.
reboot This command shuts down and restarts the operating system. It also can only be run
by root.
init 0 This command also shuts down the operating system, and can only be run by your root
user.
init 6 This command also shuts down and restarts the operating system. It also can only be
run by root.
man: is help command.
su (switch user) This command switches the current user to a new user account. This command
is most frequently used to switch to the superuser root account. In fact, if you don’t supply a
username, this utility assumes that you want to change to the root account. If you enter su -,
then you will switch to the root user account and have all of root’s environment variables
applied.
env This command displays the environment variables for the currently logged-in user.
echo This command is used to echo a line of text on the screen. It’s frequently top This
command is a very useful command that displays a list of all applications and processes
currently running on the system.
which This command is used to display the full path to a shell command or utility.
whoami This command displays the username of the currently logged-in user.
netstat This command displays the status of the network, including current connections,
routing tables, etc.
route This command is used to view or manipulate the system’s routing table.
ifconfig This command is used to manage network boards installed in the system. It can be
used to display or modify your network board configuration parameters.

Introduction to PERL and Operators in PERL


PERL

Perl is a family of two high-level, general-purpose, interpreted, dynamic programming


languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister
language", Perl 6, before the latter's name was officially changed to Raku in October 2019.
Though Perl is not officially an acronym, there are various backronyms in use, including
"Practical Extraction and Reporting Language". Perl was originally developed by Larry Wall
in 1987 as a general-purpose Unix scripting language to make report processing easier. Since
then, it has undergone many changes and revisions. Raku, which began as a redesign of Perl 5
in 2000, eventually evolved into a separate language. Both languages continue to be developed
independently by different development teams and liberally borrow ideas from one another.
Features

The overall structure of Perl derives broadly from C. Perl is procedural in nature, with variables,
expressions, assignment statements, brace-delimited blocks, control structures, and
subroutines.
Perl also takes features from shell programming. All variables are marked with leading sigils,
which allow variables to be interpolated directly into strings. However, unlike the shell, Perl
uses sigils on all accesses to variables, and unlike most other programming languages that use
sigils, the sigil doesn't denote the type of the variable but the type of the expression. So for
example, to access a list of values in a hash, the sigil for an array ("@") is used, not the sigil
for a hash ("%"). Perl also has many built-in functions that provide tools often used in shell
programming (although many of these tools are implemented by programs external to the shell)
such as sorting, and calling operating system facilities.
Perl takes lists from Lisp, hashes ("associative arrays") from AWK, and regular expressions
from sed. These simplify and facilitate many parsing, text-handling, and data- management
tasks. Also shared with Lisp are the implicit return of the last value in a block, and the fact that
all statements have a value, and thus are also expressions and can be used in larger expressions
themselves.
Perl 5 added features that support complex data structures, first-class functions (that is, closures
as values), and an object-oriented programming model. These

include references, packages, class-based method dispatch, and lexically scoped variables,
along with compiler directives (for example, the strict pragma). A major additional feature
introduced with Perl 5 was the ability to package code as reusable modules. Wall later stated
that "The whole intent of Perl 5's module system was to encourage the growth of Perl culture
rather than the Perl core."
All versions of Perl do automatic data-typing and automatic memory management. The
interpreter knows the type and storage requirements of every data object in the program; it
allocates and frees storage for them as necessary using reference counting (so it cannot
deallocate circular data structures without manual intervention). Legal type conversions — for
example, conversions from number to string — are done automatically at run time; illegal type
conversions are fatal errors.
Design

The design of Perl can be understood as a response to three broad trends in the computer
industry: falling hardware costs, rising labor costs, and improvements in
compiler technology. Many earlier computer languages, such as Fortran and C, aimed to make
efficient use of expensive computer hardware. In contrast, Perl was designed so that computer
programmers could write programs more quickly and easily.
Perl has many features that ease the task of the programmer at the expense of greater CPU and
memory requirements. These include automatic memory management; dynamic typing;
strings, lists, and hashes; regular expressions; introspection; and an eval() function. Perl follows
the theory of "no built-in limits, an idea similar to the Zero One Infinity rule.
Wall was trained as a linguist, and the design of Perl is very much informed by linguistic
principles. Examples include Huffman coding (common constructions should be short), good
end-weighting (the important information should come first), and a large collection of language
primitives. Perl favors language constructs that are concise and natural for humans to write,
even where they complicate the Perl interpreter.
Perl's syntax reflects the idea that "things that are different should look different." For example,
scalars, arrays, and hashes have different leading sigils. Array indices and hash keys use
different kinds of braces. Strings and regular expressions have different standard delimiters.
This approach can be contrasted with a language such as Lisp, where the same basic syntax,
composed of simple and universal symbolic expressions, is used for all purposes.

Perl does not enforce any particular programming paradigm (procedural, object- oriented,
functional, or others) or even require the programmer to choose among them.
There is a broad practical bent to both the Perl language and the community and culture that
surround it. The preface to Programming Perl begins: "Perl is a language for getting your job
done." One consequence of this is that Perl is not a tidy language. It includes many features,
tolerates exceptions to its rules, and employs heuristics to resolve syntactical ambiguities.
Because of the forgiving nature of the compiler, bugs can sometimes be hard to find. Perl's
function documentation remarks on the variant behavior of built-in functions in list and scalar
contexts by saying, "In general, they do what you want, unless you want consistency.
No written specification or standard for the Perl language exists for Perl versions through Perl
5, and there are no plans to create one for the current version of Perl. There has been only one
implementation of the interpreter, and the language has evolved along with it. That interpreter,
together with its functional tests, stands as a de facto specification of the language. Perl 6,
however, started with a specification, and several projects aim to implement some or all of the
specification.
Applications

Perl has many and varied applications, compounded by the availability of many standard and
third-party modules.
Perl has chiefly been used to write CGI scripts. It is also an optional component of the popular
LAMP technology stack for Web development, in lieu of PHP or Python. Perl is used
extensively as a system programming language in the Debian GNU/Linux distribution
Perl is often used as a glue language, tying together systems and interfaces that were not
specifically designed to interoperate, and for "data munging,” that is, converting or processing
large amounts of data for tasks such as creating reports. In fact, these strengths are intimately
linked. The combination makes Perl a popular all-purpose language for system administrators,
particularly because short programs, often called "one-liner programs," can be entered and run
on a single command line.
Perl code can be made portable across Windows and Unix; such code is often used by suppliers
of software (both COTS and bespoke) to simplify packaging and maintenance of software
build- and deployment-scripts.

Graphical user interfaces (GUIs) may be developed using Perl. For example,
Perl/Tk and wxPerl are commonly used to enable user interaction with Perl scripts. Such
interaction may be synchronous or asynchronous, using callbacks to update the GUI.
Perl is a general-purpose programming language originally developed for text manipulation
and now used for a wide range of tasks including system administration, web development,
network programming, GUI development, and more.
What is Perl?
 Perl stands for Practical Extraction and Reporting Language.
 Perl is a stable, cross platform programming language.
 Though Perl is not officially an acronym but few people used it as Practical Extraction
and Report Language.
 It is used for mission critical projects in the public and private sectors.
 Perl is an Open Source software, licensed under its Artistic License, or the GNU
General Public License (GPL).
 Perl was created by Larry Wall.
 Perl 1.0 was released to usenet's alt.comp.sources in 1987.
 At the time of writing this tutorial, the latest version of perl was 5.16.2.
 Perl is listed in the Oxford English Dictionary.
 PC Magazine announced Perl as the finalist for its 1998 Technical Excellence Award
in the Development Tool category.
Although the first platform Perl inhabited was UNIX, it has since been ported to over
70 different operating systems including, but not limited to, Windows 9x/NT/2000,
MacOS, VMS, Linux, UNIX (many variants), BeOS, LynxOS, and QNX.

Uses of Perl
1. Tool for general system administration
2. Processing textual or numerical data
3. Database interconnectivity
4. Common Gateway Interface (CGI/Web) programming
5. Driving other programs! (FTP, Mail, WWW, OLE)

Perl Features
 Perl takes the best features from other languages, such as C, awk, sed, sh, and BASIC,
among others.
 Perls database integration interface DBI supports third-party databases including
Oracle, Sybase, Postgres, MySQL and others.
 Perl works with HTML, XML, and other mark-up languages.
 Perl supports Unicode.
 Perl is Y2K compliant.
 Perl supports both procedural and object-oriented programming.
 Perl interfaces with external C/C++ libraries through XS or SWIG.

Language properties
 Perl is an interpreted language – program code is interpreted at run time. Perl is
unique among interpreted languages, though. Code is compiled by the interpreter
before it is actually executed.
 Many Perl idioms read like English
 Free format language – whitespace between tokens is optional
 Comments are single-line, beginning with #
 Statements end with a semicolon (;)
 Only subroutines and functions need to be explicitly declared
 Blocks of statements are enclosed in curly braces {}
 A script has no “main()”
Basic Operators
Arithmetic

Example Name Result


$a + $b Addition Sum of $a and $b
$a * $b Multiplication Product of $a and $b
$a % $b Modulus Remainder of $a divided by
$b
$a ** $b Exponentiation $a to the power of $b

String

Example Name Result


$a . “string” Concatenation String built from pieces
“$a string” Interpolation String incorporating the
value of $a
$a x $b Repeat String in which $a is
repeated $b times

Autoincrement and Autodecrement


The autoincrement and autodecrement operators are special cases of the assignment
operators, which add or subtract 1 from the value of a variable:
Example Name Result
++$a, $a++ Autoincrement Add 1 to $a
--$a, $a-- Autoincrement Subtract 1 from $a

Logical Conditions for truth: Any string is true except for “” and “0” Any number is true
except for 0 Any reference is true Any undefined value is false

Example Name Result


$a && $b And True if both $a and $b are
true
$a || $b Or $a if $a is true; $b otherwise
!$a Not True if $a is not true
$a and $b And True if both $a and $b are
true
$a or $b Or $a if $a is true; $b otherwise
not $a Not True if $a is not true

You might also like