# STATISTICS AND PROBABILITY FOR BIOINFORMATICS

1.INTRODUCTION 2.EVENTS , PROBABILITY AND RULES 3. CLASSICAL PROBABILITY: EQUALLY LIKELY OUTCOMES 4.SUBJECTIVE PROBABILITIES 5. PROBABILITY RULES 6.(PROBABILISTIC) INDEPENDENCE 7.SEQUENCE ANALYSIS:PAIRWISE ALIGNMENT 8.SUBSTITUTION MATRICES 9.HIDDEN MARKOV MODELS

By KUMAR PARIJAT TRIPATHI

INTRODUCTION
An experiment is a situation involving chance or probability that leads to results called out comes. An outcome is the result of a single trial of an experiment An event is one or more outcomes of an experiment Probability is the measure of how likely an event is. Probability Of an Event
P(A) = The Number Of Ways Event A Can Occur The Total Number Of Possible Outcomes

DETERMINISTIC AND RANDOM EXPERIMENT

Deterministic Experiment: The Experiments which have got only one possible result or outcome i.e. whose result is certain or unique are called deterministic or predictable experiments. The result of these experiments is predictable with certainity and the result is known prior to its conduct. This approach stipulates that the conditions under which the experiment is conducted would determine its result. Probabilistic Experiment :An Experiment whose result is uncertain i.e. a random experiment is a probabilistic experiment. The experiment results in two or more outcomes. The result/outcome of the experiment would be one of the possible outcomes but cannot be predicted prior to its conduct. The different possible outcomes of the experiment can be known or assessed. But it would not be possible to predict the occurance of a particular outcome at any particular execution of the experiment.

EVENTS, PROBABILITY,RULES
    

EVENTS(informal): tossed coin comes up heads, roll of two dice gives a double 6,etc. Subsets (formal): of a sample space. Prepositions: “ it will rain tomorrow”or “ India will win the match” all three denoted by A,B,….H,.. A or B can be represented by A U B or A V B, A and B are represented by A Λ B, where U means union and Λ means intersection,while not A is represented by A’. Basic Expression: pr(A / H) , A,H are events, sets or propositions, pr=probability,/ =“ given”,” conditional upon”, “ on the assumption of”,sometimes H is hypothesis under which probability is evaluated.

CLASSICAL PROBABILITY: EQUALLY LIKELY OUTCOMES
 

Roll a fair dice: pr( 6/ even) = 1/3 Roll 2 fair dice: pr(3 on first, 4 on second/ sum=7)= 1/6 Pick a card at random from a pack: pr(ace/spade)=1/13, pr(ace of spades) =1/52 Probabilities from frequencies: over along period, approximately 51% of live birth are male . Pr(next birth is male/H) ~ 0.51 E.coli has been sequenced. Of the 4.63 Mb, 1.14Mb are Adenine,1.18 are Cytosine, 1.18 are Guanine, and 1.14 are Thymine. Chose a random base on one strand : pr(Adenine/H) =1.14/4.63~0.25

SUBJECTIVE PROBABILITIES

A=“ interest rates will be higher on feb1,2001 than they are now’,pr(A/H) = something for some people , such judgements are made in the finance industry. A bookie offers odds of 2:1 AGAINST horse A winning a race.so bookie’s pr(A will win/H) = 1/3 . H? A bookie offer odds of 2:1 ON horse A winning a race. Then bookie’s pr(A will win /H) = 2/3

PROBABILITY RULES (FOR ALL APPROACHES)
         

0 <= pr(A/H) <= 1 IF A1 = A2, pr(A1/H)=pr(A2/H) IF H1= H2, pr(A/H1) = pr(A/H2) IF A is certain given H, pr(A/H)= 1 IF A is impossible given H , pr (A/H) = 0 IF pr(A&B/H) = 0, Then pr (A or B/H) = pr(A/H) + pr (B/H) pr (A&B/H) = pr(A/H) * pr(B/A& H) Complement Rule: pr(A’/H) = 1 - pr(A/H) proof: A or A’ is certain, A and A’ is impossible given H, the given rules( with B = A’): 1=pr(A or A’/H) = pr(A/H) + pr(A’/H)

APPROPRIATENESS OF ADDITION RULE
    

Roll 2 fair dice, put A =“sum is even”, B=“ sum is 7” pr(A/H)= 18/36 pr(B/H)= 6/36 pr(A or B/H)=(18+6)/36=2/3 The proportion of AUSTRALIAN aborigines aged 0-4 yrs in the 1971 census was 0.177; aged5-9 yrs in the 1971 census was 0.154. What is the chance that a randomly selected AUSTRALIAN ABORIGINES is aged less than 9 years? Put A=“ A.A is aged 0-4”, B= “ AA is aged 59,pr(AA. Aged less than 9 yrs/H)= pr(A or B/H) = 0.177 + 0.154=0.331

EXAMPLES 1. AND 2.
    

     

Roll three dice . What is the chance of at least one ace? There are 6*6*6=216 possible combinations results on the three dice how many involve at least one ace? HARD! How many involve no aces ? 5*5*5= 125 pr(at least one ace) = 1- pr(no aces) = 1-(5/6)*(5/6)*(5/6) pick a card at random from a well shuffled pack.Then pick a second card likewise,not replacing the first card before doing so. Let A=“ 1st card red”,B=“ 2nd card red and H = the usual pr(A/H)=26/52=1/2 pr(B/A&H)=25/51 pr(A&B/H)?by the multiplication rules, pr(A&B/H)=26/52 *25/51 B= (B&A) or (B&A’), then by the addition rules pr(B/H)= pr(B&A/H) + pr(B&A’/H)= 26/52*25/51+26/52*26/51=1/2

EXAMPLE 3.

  

Of the 4.639,221 base pair of sequence of E.coli,1,142,136 are A’s and there were 255,179 occurrence of the dinucleotide AA. A position in the E.coli genome is chosen at random and A = “an A at that position”, while B= “ an A at the next position”. Pr(A/H)= 1,142,136/ 4,639,221 ~1/4 Pr(B/A&H)= 255,179/1,142,136 Pr(A&B/H) = 255,179/4,639,221( MULTIPLICATION MODE) Pr (B/H)~1/4( AS ABOVE)

EXTENDED ADDITION AND MULTIPLICATION RULES
 

  

ADDITION RULES If A1,A2,…….. are mutually exclusive given h, that is a1 and a2 is impossible given H ( more generally , pr(A1&A2/H)=0) , then pr(A1 or A2 or ……./H) = pr(A1/H) + pr(A2/H) + ………….. MULTIPLICATION RULES Pr(A1&A2&A3&………./H)= Pr(A1/H)*Pr(A2/A1&H)*Pr(A3/A1&A2&H)*… ……….

(PROBABILISTIC) INDEPENDENCE
     

     

Say B is (probabilistically) independent of A given H if pr(B/A&H) = pr(B/H) (*) (*) implies pr(A&B/H) = pr(A/H)*pr(B/H) This is usually taken as the definition of independence. (*) implies A is independent of B given H , as long as pr(B/H)≠ 0 PROOF: pr(A&B/H)= pr(A/H) *pr(B/A&H) and also pr(A&B/H)=pr(B/H)*pr(A/B&H) cancel pr(B/H) if it ≠ 0 summary: as long as everything is ≠ 0, any one of: pr(A/B&H) = pr(A/H) pr(B/A&H) = pr(B/H) Pr(A&B/H) = pr(A/H)* pr(B/A&H) implies the other two.

SEQUENCE ANALYSIS: PAIR WISE ALIGNMENT

One of the basic problems a biologist is faced with when given two DNA or protein sequences is to determine whether they are related.

The theory of sequence alignment is to determine (a) the best alignment between the two sequences and (b) whether two sequences show similarity by pure chance or due to common ancestry.

Alignment algorithms strive to model the mutational process giving rise to the two sequences. The basic mutational processes are: 1.Substitutions: replace a residue (DNA base or amino acid) with another. 2· Insertions: add residues to the sequence. 3· Deletions: remove residues from the sequence. Insertions and deletions result in gaps in the alignment.

When calculating the total score of an alignment X : x(1),x(2),x(3),x(4),........................x(n) Y: y(1),y(2),y(3),y(4)..........................y(n) (where x(i) and y(j) now are either a sequence residue, or a gap) we assume independence between residues, such that the probability of the alignment is Pr(Alignment) = Pr(x(1),y(1))*Pr(x(2),y(2))*………………Pr(x(n),y(n)) Where Pr(x(i),y(i)) = the probability of aligning residues x(i) with y(i) . since alignments usually are long, resulting in a very low total probability of the alignment, it is common to use the logarithm of the probability as the score of the alignment. Log (Pr(Alignment)) = log(Pr(x(1),y(1))) + log(Pr(x(2),y(2))) +…............... log (Pr(x(n),y(n))) resulting in a score with the additive property. S = s(x(1),y(1)) +s(x(2),y(2)+………………s(x(n),y(n))

SUBSTITUTION MATRICES:
We can always produce an optimal alignment and an alignment score of two sequences, whether they really are related or not. But when is the score high enough to infer homology? One way to answer this is by comparing the probability of the alignment when we assume homology, to the probability of the alignment when we assume the sequences to be independent. Thus we have two models: Match model (M): (assuming homology) The residues x(i) and y(i) at position i in the alignment occur together with probability pr(x(i),y(i)). Positions in the alignment are still assumed to be independent, but the sequences are assumed to be dependent (that is,pr (x(i),y(i)) ≠q(x(i))*q(y(i)) ) Random model (R): (assuming no homology) Now both the sequences and the positions in the alignment are assumed to be independent. Thus at each position i in the alignment residues x(i) and y(i) occur with probability q(x(i)). q(y(i))

We score the alignment using the relative likelihood

The idea is as follows: x and y are really homologous:

If S is “large enough” we reject the random model and assume the sequences to be homologous.

HIDDEN MARKOV MODELS

Markov Chains: A Markov chain is a random process
X(0),X(1),X(2),…….. which jumps randomly between different states in a state space S ={ S(1),S(2),S(3)……with something called the memoryless property: what state the process will jump to next only depend on the current state, not the passed ones.

which state to begin the process in is determined by the initial distribution

The process jumps between different states according to the transition probabilities

Example
State space S = {a,b,c} . Initial probabilities π = { 0.3,0.3,0.4}

The arrows indicate possible transitions, and the numbers are transitions probabilities. For instance

The transition probabilities can be organized in an array, or a transition matrix.

Possible outcome sequence: X(0)=a,X(1)=a,X(2)=b,X(3)= c . Impossible outcome: X(0)= b,X(1)=b,X(2)=a, X(3)=b.

BIOLOGICAL MOTIVATION

MARKOV CHAIN MODEL

MARKOV CHAIN MODEL

Hidden Markov Models
INTRODUCTION:

We have a Markov chain X(0),X(1),X(2)……. taking values in a state space S = {s(1), s(2),s(3)…s (N) } before. However, instead of observing which states we jump between we observe something that is a function (depends on) of the states. We observe an output sequence , Y(0),Y(1),Y(2)..Y(N) which depends on the current state.

EXAMPLE

Example
Assume we have two dice A and B:

where A generates numbers between 1 and 6 and B generates numbers between 1 and 4. The process is as follows: 1. We randomly choose a die to start with, A or B 2. We roll the die and record the number 3. We choose whether to roll the current die again, or switch to the other 4. Repeat steps 2-3.

To translate this into HMM formulation:
The state space is S ={A,B} . We randomly choose the first die X(0) according to the initial probabilities π = (π(A), π(B) ) Where Pr( X(0) = A) = π(A) The first observed number Y(0)= y appears with probability e(0) (y). We switch between states according to transition probabilities ·

In roll n the state is X (n) and the observed number is Y (n)

Now assume that someone else rolled the dice, and we only know the underlying probabilities (initial distribution, transition probabilities, output distribution) and have the observed output

This is a hidden Markov model where the hidden states are which die was used in each roll the output sequence is the number observed The HMM theory can help us answering questions like. What is the probability of observing such a series giving our model? What is the most likely underlying sequence of dice (state sequence) giving rise to this output ?

PROFILE HMM FOR SEQUENCE ALIGNMENT

Sign up to vote on this title