You are on page 1of 35

Few Basics in Probability, Random

Variables, matrix and optimization


Probability and random variables
• Signals: classified as deterministic or random

• Random: Unpredictable, Cannot predict the


value at time t with certainty.

• Consider x(t )  A cos( 2f t   )


1 If A, .. are
known x(t) known for all t. So deterministic
• If x(t) is generated from an oscillator with poor
frequency stability, output is random

• Another example: Tossing a coin (outputs are


uncertain)

• There are math tools available for analysis and


characterization of random phenomenon
• To do this: We need concept of probability,
random variables, random process (Not
discussed here)
Probability Concepts
• Some definitions:
Random experiment: An experiment whose outcomes
not known in advance
Outcome: Result of an experiment
Event: outcome or set of outcomes
Sample space: Set of all possible outcomes
Sample point: Outcome of an experiment
ME or disjoint events: A and B events are ME if they
cannot occur together
• Null event: Event with no sample points. If A
and B are ME then AB=null event
• Certain event: Consists of all outcomes

• Definition of probability:
1. Relative frequency
2. Classical
• Relative frequency definition:
n=number of times random expt repeated
n A = number of times event A occurs
 nA 
P( A)  lim  
n   n 

For small n P(A) fluctuates, but tend to limiting


value for large n. Empirical approach
• Classical definition:
No experiment required.
NA
P ( A) 
N
N= number of possible outcomes
N A = outcomes favourable to A
Assumption: Events are equally likely!
• Axioms ( )of probability: Properties that
statements accepted

probability has to satisfy


• For each event we assign a real number P(A),
that should satisfy
1. P ( A)  0
2 P( S )  1
3 if A  S , B  S , AB  
P( A  B)  P ( A)  P( B )
• Suppose A and B are not mutually exclusive,
then?
P(A+B)=P(A)+P(B)-P(AB)

If A, B and C are not ME

P(A+B+C)=P(A)+P(B)+P(C)-P(AB)-P(BC)-P(AC)+P(ABC)
Conditional Probability
P( A / B )  P ( AB) / P ( B )
P( B / A)  B ( AB) / P ( A)
P( A / B )  [ P ( B / A) P ( A)] / P ( B ) known as Bayes ' Theorem

• Bayes Theorem extensively used while solving


problems in ML area.

• ML and MAP estimations are based on this


Independent events
• If A and B are independent,
P(A/B)=P(A)
P(B/A)=P(B)
That makes P(AB)=P(A)P(B)
• If A, B, and C are independent, we should have
P(AB)=P(A)P(B)
P(BC)=P(B)P(C)
P(CA)=P(C)P(A) and P(ABC)=P(A)P(B)P(C)
Random variable
• A random variable (RV) defines a function
whose domain is the sample space of random
experiment and whose range is the real line

• X S --- R

• So a RV assigns a number X(f) for every


outcome f of an experiment
• Let the experiment be that of throwing a die

• Sample space consists of 6 faces of die i.e., 6


sample points

• Let the transformation be X ( f i )  10i


Then the random variable denoted by X takes
6 values.
Distribution and density functions
• One can study the behaviour of random
variable by using CDF and PDF

FX ( x)  P( X  x)
• Cumulative distribution function:

• FX () Is a function whose domain is real line and


range [0,1]
• A density function is obtained by
differentiating CDF

• Some of the distributions used in practice:


• Discrete: Binomial, Poisson, Geometric, etc
• Continuous: Uniform, Normal or Gaussian,
Gamma, Beta, Cauchy, Rayleigh etc
Gaussian random variable
• Single random variable then univariate pdf
given by

• When we have multiple random variables (n),


the multivariate pdf is given by
• Covariance matrix: If N RVs the matrix has a
size of NXN (Note: previous slide n is used)

Cov( X 1 , X 1 ) Cov( X 1 , X 2 )...........Cov( X 1 , X N ) 


Cov( X , X ) Cov( X , X )...........Cov( X , X ) 
 2 1 2 2 2 N 
........................................................................ 
 
.................... .
CX   
 
 
......................................................................... 
Cov( X , X ) Cov( X , X ).........Cov( X , X )
 N 1 N 2 N N

 
Cov( X 1 , X 1 )  Var ( X 1 )  E ( X 1  m X 1 ) 2
Cov( X 1 , X 2 )  E ( X 1  m X 1 )( X 2  m X 2 )
• E represents expectation or mean
E ( X )   xi P ( X  xi )
i

• If RVs are uncorrelated COV is 0. If independent they are


uncorrelated, not the other way

• Independence requires joint density function=product of


marginal densities.
• Properties of co-variance matrix/Correlation
matrix: (For Correlation, consider non-mean
subtracted RVs)

T
Symmetric C X  C X

Positive semi definite a C a  0


T
X

Eigen values greater than or equal to zero


Eigen vectors are orthogonal
Some points on eigen values and eigen
vectors
• Consider a square matrix. Eigen vectors x are those
which satisfy T[x]=scalar multiplied by x
• i.e., Ax  x, xis the eigen vector corresponding to
eigen value 


• How do we find ?

| ( A  I ) | 0
• Make why?
• Eigen values of a covariance matrix are always
positive why?
• Question then is:
Do the inverse of covariance matrix always
exist?

First check: What is the condition for existence


of inverse of a matrix?

Is covariance matrix diagonalizable?


Matrix Diagonalization
• One can represent a square matrix as:
AN  N  U N  N  N  NU 1 N  N

This is called diagonalization of A matrix


We have N eigen values and N eigen vectors each
of size NX1. So U is a eigen vector matrix.
Here is  diagonal matrix with diagonal entries as
eigen values.
Can any square matrix be decomposed as above?
Central Limit Theorem
• To understand this: Need to know IID RVs.
• IID is: Independent and identically distributed
To state the theorem: Consider n IID random
variables
Take the sum of these i.e.,
S n  X 1  X 2  .........  X n
• Now the RV S n has mean nand variance as
n 2

Zn
• Form normalized RV
S n  n
Zn 
as n

This random variable has Gaussian distribution with


zero mean and variance 1.
Convex Optimization
(You may refer – book/youtube lecture by Stephen Boyd)

• In ML we often use optimization techniques to find the required


parameters/weights of a function relating the input and output

• Given a function f in parameters, we minimize or maximize the same for


searching the optimum parameter values

• Minimization/Maximizaton can be done using optimization techniques.

• Optimization is necessary (to find optimal parameters) in many ML


techniques- Ex: linear regression, logistic regression, SVM to name a few.
• In general finding a global optimum solution of a
function with number of unknowns (as
parameters) is difficult and computationally
taxing. This is true especially for DNN (we end up
getting non-convex function)

• But for the class of problems in ML involving


convex functions, one can find the global solution
that too with less computational complexity.
Convex Sets
• A set C is convex set if for any two points x and
y belonging to C and   R, 0  1
the linear combination also belongs to C

x  (1   ) y  C
• Intuitively: If we take two points in C and draw
a line, every point in that line also belongs to C
Examples
n
• All of R
• The set || x || 1
• Affine subspace: The set of all x belonging to
n
R Ax  b
Such that
• Intersection of convex sets is also convex
• Positive semidefinite matrices
Convex function
• A function f : R  R is convex if its domain is a
n

convex set denoted by say D( f ) and if for all


x, y  D ( f ) and   R, 0    1

f (x  (1   ) y )  f ( x)  (1   ) f ( y )

Intuitively value of function at combined point


is less than or equal to value on the straight
line
• We say the function as strictly convex if equal
to is not considered.

• If f is convex, -f will be concave.


First order condition for convexity
• Draw a tangent at any point on the line. Every
point on this line should be below the
corresponding point on the function
Second order condition
• Hessian must be positive semidefinite. For example:

f ( x1 , x2 )  x12  x22
 2 f 2 f 
 2 

 1 x x1 x 2   Hessian of f x , x   2 0
 2 f 1 2 0 2 
2 f   
 2 
 x2 x1  x2 

• For positive semidefinite we want 2 0  a1 


[a1, a2 ]    0
0 2   a2 
• If Hessian is negative semidefinite then concave function.
• If one variable function, second derivative greater or zero
Examples of convex functions
• Exponential function f ( x)  e ax

second derivative is positive for all x, so convex


, Second derivative is …, so convex
f ( x)   log( x)
Affine functions: Hessian is 0
T
f ( x)  b x  c
So, both +ve semi and –ve semidefinite. Hence
Affine functions are concave as well as convex.
Quadratic functions: T1 T
f ( x)  x Ax  b x  c
Hessian is ?, so convex/non2convex is determined
by positive semidefinite of Hessian

You might also like