Prob RV Opt Basics

Few Basics in Probability, Random
Variables, matrix and optimization

Probability and random variables
• Signals: classified as deterministic or random
• Random: Unpredictable, Cannot predict the

value at time t with certainty.
• Consider x(t )  A cos( 2f t   )

1 If A, .. are
known x(t) known for all t. So deterministic
• If x(t) is generated from an oscillator with poor
frequency stability, output is random
• Another example: Tossing a coin (outputs are

uncertain)
• There are math tools available for analysis and

characterization of random phenomenon
• To do this: We need concept of probability,
random variables, random process (Not
discussed here)
Probability Concepts
• Some definitions:
Random experiment: An experiment whose outcomes
not known in advance
Outcome: Result of an experiment
Event: outcome or set of outcomes
Sample space: Set of all possible outcomes
Sample point: Outcome of an experiment
ME or disjoint events: A and B events are ME if they
cannot occur together
• Null event: Event with no sample points. If A
and B are ME then AB=null event
• Certain event: Consists of all outcomes
• Definition of probability:
1. Relative frequency
2. Classical
• Relative frequency definition:
n=number of times random expt repeated
n A = number of times event A occurs
 nA 
P( A)  lim  
n   n 
For small n P(A) fluctuates, but tend to limiting

value for large n. Empirical approach
• Classical definition:
No experiment required.
NA
P ( A) 
N
N= number of possible outcomes
N A = outcomes favourable to A
Assumption: Events are equally likely!
• Axioms ( )of probability: Properties that
statements accepted
probability has to satisfy

• For each event we assign a real number P(A),
that should satisfy
1. P ( A)  0
2 P( S )  1
3 if A  S , B  S , AB  
P( A  B)  P ( A)  P( B )
• Suppose A and B are not mutually exclusive,
then?
P(A+B)=P(A)+P(B)-P(AB)
If A, B and C are not ME
P(A+B+C)=P(A)+P(B)+P(C)-P(AB)-P(BC)-P(AC)+P(ABC)
Conditional Probability
P( A / B )  P ( AB) / P ( B )
P( B / A)  B ( AB) / P ( A)
P( A / B )  [ P ( B / A) P ( A)] / P ( B ) known as Bayes ' Theorem
• Bayes Theorem extensively used while solving

problems in ML area.
• ML and MAP estimations are based on this

Independent events
• If A and B are independent,
P(A/B)=P(A)
P(B/A)=P(B)
That makes P(AB)=P(A)P(B)
• If A, B, and C are independent, we should have
P(AB)=P(A)P(B)
P(BC)=P(B)P(C)
P(CA)=P(C)P(A) and P(ABC)=P(A)P(B)P(C)
Random variable
• A random variable (RV) defines a function
whose domain is the sample space of random
experiment and whose range is the real line
• X S --- R
• So a RV assigns a number X(f) for every

outcome f of an experiment
• Let the experiment be that of throwing a die
• Sample space consists of 6 faces of die i.e., 6

sample points
• Let the transformation be X ( f i )  10i

Then the random variable denoted by X takes
6 values.
Distribution and density functions
• One can study the behaviour of random
variable by using CDF and PDF
FX ( x)  P( X  x)
• Cumulative distribution function:
• FX () Is a function whose domain is real line and

range [0,1]
• A density function is obtained by
differentiating CDF
• Some of the distributions used in practice:

• Discrete: Binomial, Poisson, Geometric, etc
• Continuous: Uniform, Normal or Gaussian,
Gamma, Beta, Cauchy, Rayleigh etc
Gaussian random variable
• Single random variable then univariate pdf
given by
• When we have multiple random variables (n),

the multivariate pdf is given by
• Covariance matrix: If N RVs the matrix has a
size of NXN (Note: previous slide n is used)
Cov( X 1 , X 1 ) Cov( X 1 , X 2 )...........Cov( X 1 , X N ) 

Cov( X , X ) Cov( X , X )...........Cov( X , X ) 
 2 1 2 2 2 N 
........................................................................ 
 
.................... .
CX   
 
 
......................................................................... 
Cov( X , X ) Cov( X , X ).........Cov( X , X )
 N 1 N 2 N N

 
Cov( X 1 , X 1 )  Var ( X 1 )  E ( X 1  m X 1 ) 2
Cov( X 1 , X 2 )  E ( X 1  m X 1 )( X 2  m X 2 )
• E represents expectation or mean
E ( X )   xi P ( X  xi )
i
• If RVs are uncorrelated COV is 0. If independent they are

uncorrelated, not the other way
• Independence requires joint density function=product of

marginal densities.
• Properties of co-variance matrix/Correlation
matrix: (For Correlation, consider non-mean
subtracted RVs)
T
Symmetric C X  C X
Positive semi definite a C a  0

T
X
Eigen values greater than or equal to zero

Eigen vectors are orthogonal
Some points on eigen values and eigen
vectors
• Consider a square matrix. Eigen vectors x are those
which satisfy T[x]=scalar multiplied by x
• i.e., Ax  x, xis the eigen vector corresponding to
eigen value 

• How do we find ?
| ( A  I ) | 0
• Make why?
• Eigen values of a covariance matrix are always
positive why?
• Question then is:
Do the inverse of covariance matrix always
exist?
First check: What is the condition for existence

of inverse of a matrix?
Is covariance matrix diagonalizable?

Matrix Diagonalization
• One can represent a square matrix as:
AN  N  U N  N  N  NU 1 N  N
This is called diagonalization of A matrix

We have N eigen values and N eigen vectors each
of size NX1. So U is a eigen vector matrix.
Here is  diagonal matrix with diagonal entries as
eigen values.
Can any square matrix be decomposed as above?
Central Limit Theorem
• To understand this: Need to know IID RVs.
• IID is: Independent and identically distributed
To state the theorem: Consider n IID random
variables
Take the sum of these i.e.,
S n  X 1  X 2  .........  X n
• Now the RV S n has mean nand variance as
n 2
Zn
• Form normalized RV
S n  n
Zn 
as n
This random variable has Gaussian distribution with

zero mean and variance 1.
Convex Optimization
(You may refer – book/youtube lecture by Stephen Boyd)
• In ML we often use optimization techniques to find the required

parameters/weights of a function relating the input and output
• Given a function f in parameters, we minimize or maximize the same for

searching the optimum parameter values
• Minimization/Maximizaton can be done using optimization techniques.
• Optimization is necessary (to find optimal parameters) in many ML

techniques- Ex: linear regression, logistic regression, SVM to name a few.
• In general finding a global optimum solution of a
function with number of unknowns (as
parameters) is difficult and computationally
taxing. This is true especially for DNN (we end up
getting non-convex function)
• But for the class of problems in ML involving

convex functions, one can find the global solution
that too with less computational complexity.
Convex Sets
• A set C is convex set if for any two points x and
y belonging to C and   R, 0  1
the linear combination also belongs to C
x  (1   ) y  C
• Intuitively: If we take two points in C and draw
a line, every point in that line also belongs to C
Examples
n
• All of R
• The set || x || 1
• Affine subspace: The set of all x belonging to
n
R Ax  b
Such that
• Intersection of convex sets is also convex
• Positive semidefinite matrices
Convex function
• A function f : R  R is convex if its domain is a
n
convex set denoted by say D( f ) and if for all

x, y  D ( f ) and   R, 0    1
f (x  (1   ) y )  f ( x)  (1   ) f ( y )
Intuitively value of function at combined point

is less than or equal to value on the straight
line
• We say the function as strictly convex if equal
to is not considered.
• If f is convex, -f will be concave.

First order condition for convexity
• Draw a tangent at any point on the line. Every
point on this line should be below the
corresponding point on the function
Second order condition
• Hessian must be positive semidefinite. For example:
f ( x1 , x2 )  x12  x22
 2 f 2 f 
 2 

 1 x x1 x 2   Hessian of f x , x   2 0
 2 f 1 2 0 2 
2 f   
 2 
 x2 x1  x2 
• For positive semidefinite we want 2 0  a1 

[a1, a2 ]    0
0 2   a2 
• If Hessian is negative semidefinite then concave function.
• If one variable function, second derivative greater or zero
Examples of convex functions
• Exponential function f ( x)  e ax
second derivative is positive for all x, so convex

, Second derivative is …, so convex
f ( x)   log( x)
Affine functions: Hessian is 0
T
f ( x)  b x  c
So, both +ve semi and –ve semidefinite. Hence
Affine functions are concave as well as convex.
Quadratic functions: T1 T
f ( x)  x Ax  b x  c
Hessian is ?, so convex/non2convex is determined
by positive semidefinite of Hessian

Prob RV Opt Basics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prob RV Opt Basics

Uploaded by

Copyright:

Available Formats

Few Basics in Probability, Random

Variables, matrix and optimization

• Random: Unpredictable, Cannot predict the

• Consider x(t )  A cos( 2f t   )

• Another example: Tossing a coin (outputs are

• There are math tools available for analysis and

For small n P(A) fluctuates, but tend to limiting

probability has to satisfy

If A, B and C are not ME

• Bayes Theorem extensively used while solving

• ML and MAP estimations are based on this

• So a RV assigns a number X(f) for every

• Sample space consists of 6 faces of die i.e., 6

• Let the transformation be X ( f i )  10i

• FX () Is a function whose domain is real line and

• Some of the distributions used in practice:

• When we have multiple random variables (n),

Cov( X 1 , X 1 ) Cov( X 1 , X 2 )...........Cov( X 1 , X N ) 

• If RVs are uncorrelated COV is 0. If independent they are

• Independence requires joint density function=product of

Positive semi definite a C a  0

Eigen values greater than or equal to zero

First check: What is the condition for existence

Is covariance matrix diagonalizable?

This is called diagonalization of A matrix

This random variable has Gaussian distribution with

• In ML we often use optimization techniques to find the required

• Given a function f in parameters, we minimize or maximize the same for

• Minimization/Maximizaton can be done using optimization techniques.

• Optimization is necessary (to find optimal parameters) in many ML

• But for the class of problems in ML involving

convex set denoted by say D( f ) and if for all

Intuitively value of function at combined point

• If f is convex, -f will be concave.

• For positive semidefinite we want 2 0  a1 

second derivative is positive for all x, so convex

You might also like