You are on page 1of 64

Measures of Proximity,

Probability Reminder
Discretization and Attribute
Transformation
Distances
Similarity measures
Probability - Quick Reminder
2
Discretization Using Class Labels
Entropy based approach. Let
k be a number of different classes
d be a number of defined intervals
m
i
be a number of objects in iterval
m
i,j
be a number of objects of class j in interval I
p
ij
=m
ij
/m
i
be a frequency of class j in interval I
Then purity of an interval is measured by its entropy
e
i
= -
j=1
k
p
ij
log
2
p
ij
Quality of the discretization is measured by weighted
average weighted entropy 1/m
i=1
d
m
i
e
i
Typical method is to try bisect the interval trying all possible
splitting points, then to bisect the most impure of the obtained
intervals; continued until desired d is reached.
3
Entropy-based Discretization
Suppose we have the following (attribute-value/class) pairs.
Let S denotes the 9 pairs given here. S = (0,Y), (4,Y), (12,Y),
(16,N), (16,N), (18,Y), (24,N), (26,N), (28,N).
Let p
1
= 4/9 be the fraction of pairs with class=Y, and p
2
= 5/9
be the fraction of pairs with class=N.
The Entropy (or the information content) for S is defined as:
Entropy(S) = - p
1
*log
2
(p
1
) p
2
*log
2
(p
2
) .
In this case Entropy(S)=.991076.
If the entropy small, then the set is relatively pure. The
smallest possible value is 0.
If the entropy is larger, then the set is mixed. The largest
possible value is 1, which is obtained when p
1
=p
2
=.5
4
Entropy Based Discretization
(cont)
Given a set of samples S, if S is partitioned into two intervals
S
1
and S
2
using boundary T, the entropy after partitioning is
The boundary T are chosen from the midpoints of the
attributes values, i e: {2, 8, 14, 17, 21, 25, 27}
For instance if T=14 S
1
= {(0,P), (4,P), (12,P)} and
S
2
= {(16,N), (16,N), (18,P), (24,N), (26,N), (28,N)}
E(S,T)=(3/9)*E(S
1
)+(6/9)*E(S
2
)=3/9*0+(6/9)* 0.650022
=0.4333
Information gain of the split,
Gain(S,T) = Entropy(S) - E(S,T) =0.9910-0.4333=.5577
) (
| |
| |
) (
| |
| |
) , (
2
2
1
1
s
s
s
s
Ent
S
Ent
S
T S E + =
5
Entropy Based Discretization
(cont)
Let T=21, then E(S,T)=0.6121
Information Gain=0.9910 - 0.6121 = 0.2789. Therefore T=14
is a better partition.
The goal of this algorithm is to find the split with the
maximum information gain. Maximal gain is obtained when
E(S,T) is minimal.
The best split(s) are found by examining all possible splits
and then selecting the optimal split. The boundary that
minimize the entropy function over all possible boundaries is
selected as a binary discretization.
The process is recursively applied to partitions obtained until
some stopping criterion is met, e.g.,
E(T,S) Ent(S) >
6
Discretization Using Class Labels
Entropy based approach
3 categories for both x and y 5 categories for both x and y
7
Discretization Without Using Class
Labels
Data Equal interval
width
Equal frequency K-means
8
Attribute Transformation
A function that maps the
entire set of values of a
given attribute to a new
set of replacement values
such that each old value
can be identified with one
of the new values
Simple functions: x
k
,
log(x), e
x
, |x|
Standardization and
Normalization
9
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
10
Similarity/Dissimilarity for Simple
Attributes
p and q are the attribute values for two data objects.
11
Euclidean Distance
Euclidean Distance
Where n is the number of dimensions (attributes)
and p
k
and q
k
are, respectively, the k
th
attributes
(components) or data objects p and q.
Standardization is necessary, if scales differ.

=
=
n
k
k k
q p q p dist
1
2
) ( ) , (
12
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
poi nt x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
13
Minkowski Distance
Minkowski Distance is a generalization of
Euclidean Distance
Where r is a parameter, n is the number of
dimensions (attributes) and p
k
and q
k
are,
respectively, the k
th
attributes (components) or
data objects p and q.
r
n
k
r
k k
q p dist
1
1
) | | (

=
=
14
Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L
1
norm)
distance.
A common example of this is the Hamming distance, which is just
the number of bits that are different between two binary vectors
r = 2. Euclidean distance
r . supremum (L
max
norm, L

norm)
distance.
This is the maximum difference between any component of the
vectors
Do not confuse r with n, i.e., all these distances
are defined for all numbers of dimensions.
15
Minkowski Distance
Distance Matrix
point x y
p1 3 2
p2 1 2
p3 1 4
p4 5 1
L1 p1 p2 p3 p4
p1 0 2 4 3
p2 2 0 2 5
p3 4 2 0 7
p4 3 5 7 0
L2 p1 p2 p3 p4
p1 0 2 2.828 2.236
p2 2 0 2 4.123
p3 2.828 2 0 5
p4 2.236 4.123 5 0
L

p1 p2 p3 p4
p1 0 2 2 2
p2 2 0 2 4
p3 2 2 0 4
p4 2 4 4 0
16
Minkowski Distance
Distance Matrix
point x y
p1 3 2
p2 1 2
p3 1 4
p4 5 1
L1 p1 p2 p3 p4
p1
p2
p3
p4
L2 p1 p2 p3 p4
p1
p2
p3
p4
L

p1 p2 p3 p4
p1
p2
p3
p4
17
Minkowski Distance
Distance Matrix
point x y
p1 3 2
p2 1 2
p3 1 4
p4 5 1
L1 p1 p2 p3 p4
p1 0 2 4 3
p2 2 0 2 5
p3 4 2 0 7
p4 3 5 7 0
L2 p1 p2 p3 p4
p1
p2
p3
p4
L

p1 p2 p3 p4
p1
p2
p3
p4
18
Minkowski Distance
Distance Matrix
point x y
p1 3 2
p2 1 2
p3 1 4
p4 5 1
L1 p1 p2 p3 p4
p1 0 2 4 3
p2 2 0 2 5
p3 4 2 0 7
p4 3 5 7 0
L2 p1 p2 p3 p4
p1 0 2 2.828 2.236
p2 2 0 2 4.123
p3 2.828 2 0 5
p4 2.236 4.123 5 0
L

p1 p2 p3 p4
p1
p2
p3
p4
19
Minkowski Distance
Distance Matrix
poi nt x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L

p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
20
Mahalanobis Distance - Definition
T
q p Q q p q p s mahalanobi ) ( ) ( ) , (
1
=

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
where Q is the
covariance matrix of the
input data X

=
n
i
k
ik
j
ij k j
X X X X
n
Q
1
,
) )( (
1
1
21
Mahalanobis Distance- Examples
Covariance Matrix:
(

=
3 . 0 2 . 0
2 . 0 3 . 0
Q
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
22
Mahalanobis Distance -
Computing
A: (0.5, 0.5), B: (0, 1), C: (1.5, 1.5)
Data Matrix X - x =
Covariance matrix
Q =
Inverse
Q
-1
=
Distance
Mah(A,B)=
23
Mahalanobis Distance -
Computing
A: (0.5, 0.5), B: (0, 1), C: (1.5, 1.5)
Centered Data Matrix X - x =
Covariance matrix
Q =
Inverse
Q
-1
=
Distance
Mah(A,B)=
(
(

5 . 0
0
5 . 0
84 . 0
66 . 0
16 . 0
24
Mahalanobis Distance -
Computing
A: (0.5, 0.5), B: (0, 1), C: (1.5, 1.5)
Centered Data Matrix X - x =
Covariance matrix
Q =
Inverse
Q
-1
=
Distance
Mah(A,B)=
(
(

5 . 0
0
5 . 0
84 . 0
66 . 0
16 . 0
(

=
(
(

25 . 0 25 . 0
25 . 0 584 . 0
5 . 0
0
5 . 0
84 . 0
66 . 0
16 . 0
5 . 0 0 5 . 0
84 . 0 066 . 16 . 0
1 3
1
25
Mahalanobis Distance -
Computing
A: (0.5, 0.5), B: (0, 1), C: (1.5, 1.5)
Centered Data Matrix X - x =
Covariance matrix
Q =
Inverse
Q
-1
=
Distance
Mah(A,B)=
(
(


5 . 0
0
5 . 0
84 . 0
0
16 . 0
(

=
(

807 . 2 202 . 1
202 . 1 202 . 1
584 . 0 25 . 0
25 . 0 25 . 0
208 . 0
1
(

=
(
(

25 . 0 25 . 0
25 . 0 584 . 0
5 . 0
0
5 . 0
84 . 0
66 . 0
16 . 0
5 . 0 0 5 . 0
84 . 0 066 . 16 . 0
1 3
1
26
Mahalanobis Distance -
Computing
A: (0.5, 0.5), B: (0, 1), C: (1.5, 1.5)
Data Matrix X - x =
Covariance matrix
Q =
Inverse
Q
-1
=
Distance
Mah(A,B)=
(
(


5 . 0
0
5 . 0
84 . 0
0
16 . 0
(

=
(
(

5 . 0 25 . 0
25 . 0 3626 . 0
5 . 0
0
5 . 0
84 . 0
0
16 . 0
5 . 0 0 5 . 0
84 . 0 0 16 . 0
1 3
1
[ ] 603 . 1
5 . 0
5 . 0
807 . 2 202 . 1
202 . 1 202 . 1
5 . 0 5 . 0 =
(

=
(

807 . 2 202 . 1
202 . 1 202 . 1
584 . 0 25 . 0
25 . 0 25 . 0
208 . 0
1
27
Common Properties of a
Distance
Distances, such as the Euclidean distance, have
some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
Given a set X, a function f: XXR that satisfies
these properties is a distance (or a metric). The set
X is called metric space.
28
Common Properties of a
Similarity
Similarities, also have some well
known properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points
(data objects), p and q.
29
Similarity Between Binary
Vectors
Common situation is that objects, p and q, have
only binary attributes
Compute similarities using the following
quantities
M
01
= the number of attributes where p was 0 and q was 1
M
10
= the number of attributes where p was 1 and q was 0
M
00
= the number of attributes where p was 0 and q was 0
M
11
= the number of attributes where p was 1 and q was 1
Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M
11
+ M
00
) / (M
01
+ M
10
+ M
11
+ M
00
)
J = number of 11 matches / number of not-both-zero attributes
= (M
11
) / (M
01
+ M
10
+ M
11
)
30
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) /
(2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
31
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) /
(2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
32
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) /
(2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
33
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) /
(2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
34
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) /
(2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
35
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) /
(2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
36
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) = (0+7) /
(2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
37
Cosine Similarity
If d
1
and d
2
are two document vectors, then
cos( d
1
, d
2
) = (d
1
d
2
) / ||d
1
|| ||d
2
|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d
1
|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
1/2
= (42)
1/2
= 6.481
||d
2
|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
1/2
= (6)
1/2
= 2.245
cos( d
1
, d
2
) = .3150
38
Cosine Similarity
If d
1
and d
2
are two document vectors, then
cos( d
1
, d
2
) = (d
1
d
2
) / ||d
1
|| ||d
2
|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d
1
|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
1/2
= (42)
1/2
= 6.481
||d
2
|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
1/2
= (6)
1/2
= 2.245
cos( d
1
, d
2
) = .3150
39
Cosine Similarity
If d
1
and d
2
are two document vectors, then
cos( d
1
, d
2
) = (d
1
d
2
) / ||d
1
|| ||d
2
|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d
1
|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
1/2
= (42)
1/2
= 6.481
||d
2
|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
1/2
= (6)
1/2
= 2.245
cos( d
1
, d
2
) = .3150
40
Cosine Similarity
If d
1
and d
2
are two document vectors, then
cos( d
1
, d
2
) = (d
1
d
2
) / ||d
1
|| ||d
2
|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d
1
|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
1/2
= (42)
1/2
= 6.481
||d
2
|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
1/2
= (6)
1/2
= 2.245
cos( d
1
, d
2
) = .3150
41
Cosine Similarity
If d
1
and d
2
are two document vectors, then
cos( d
1
, d
2
) = (d
1
d
2
) / ||d
1
|| ||d
2
|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d
1
|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
1/2
= (42)
1/2
= 6.481
||d
2
|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
1/2
= (6)
1/2
= 2.245
cos( d
1
, d
2
) = .3150
42
Extended Jaccard Coefficient
(Tanimoto)
Variation of Jaccard for continuous or count attributes
Reduces to Jaccard for binary attributes
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 5
||d
1
|| = 6.481
||d
2
|| = 2.245
T(d
1,
,d
2
)=?0.119
43
Extended Jaccard Coefficient
(Tanimoto)
Variation of Jaccard for continuous or count attributes
Reduces to Jaccard for binary attributes
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 5
||d
1
|| = 6.481
||d
2
|| = 2.245
T(d
1,
,d
2
)= 0.119
44
Correlation
Correlation measures the linear relationship between
objects
To compute correlation, we standardize data objects,
p and q, and then take their dot product
) ( / )) ( ( p std p mean p p
k k
=

) ( / )) ( ( q std q mean q q
k k
=

q p q p q p n correlatio
T

=

= ) ( ) , (
2
1
__
) (
1
1
) (

=
n
i
x x
n
x std
45
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
1 to 1.
46
General Approach for Combining
Similarities
Sometimes attributes are of many different types,
but an overall similarity is needed.
47
Using Weights to Combine
Similarities
May not want to treat all attributes the
same.
Use weights w
k
which are between 0 and 1
and sum to 1.
48
Density
Density-based clustering require a notion
of density
Examples:
Euclidean density
Euclidean density = number of points per unit
volume
Probability density
Graph-based density
49
Euclidean Density Cell-based
Simplest approach is to divide region into a number of
rectangular cells of equal volume and define density as # of
points the cell contains
50
Euclidean Density Center-
based
Euclidean density is the number of points
within a specified radius of the point
51
Random Variable and
Distribution
A sample space S is a nonempty set of possible
outcomes of a random experiment.
A random variable X is a function on the sample
space that has the numerical range.
The distribution of a random variable is the
collection of possible values of random variable
along with their probabilities:
Discrete case:
Continuous case:
Pr( ) ( ) X x p x

= =
Pr( ) ( )
b
a
a X b p x dx

52
Random Variable: Example
Let the sample space S be the set of all
sequences of three rolls of a die. Let X be the
sum of the number of dots on the three rolls.
What are the possible values for X?
Pr(X = 5) = ?, Pr(X = 10) = ?
53
Expectation
A random variable X~Pr(X=x). Then, its
expectation (aka mean) is
Continuous case:
Expectation of sum of random variables
[ ] Pr( )
x
E X x X x = =

[ ] ( ) E X xp x dx

1 2 1 2
[ ] [ ] [ ] E X X E X E X + = +
54
Expectation: Example
Let S be the set of all sequence of three rolls of
a die. Let X be the sum of the number of dots on
the three rolls.
What is E(X)?
Let S be the set of all sequence of three rolls of
a die. Let X be the product of the number of dots
on the three rolls.
What is E(X)?
55
Operating with Expectations
Rule 1: If X is a random variable with
mean
X
and a and b are fixed constants,
then the mean of a + bX is a + b
X
(i.e.

a+bX
= a + b
X
)
Rule 2: If X and Y area random variables
with means
X
and
Y
respectively, then
the mean of X + Y is
X
+
Y
(i.e.
X+Y
=
X
+
Y
)
56
Variance
The variance of a random variable X is the
expectation of (X-E[X])
2
:
2
2 2
2 2
2 2
( ) (( [ ]) )
( [ ] 2 [ ])
( [ ] )
[ ] [ ]
Var X E X E X
E X E X XE X
E X E X
E X E X
=
= +
=
=
The standard deviation of a random variable X is:
2 2
) ( ) ( ) ( X E X E X Var X = = ) (
57
Variance: Example
Let S be the set of all sequence of two rolls of a
die. Let X be the sum of the number of dots on
the two rolls.
What is var(X)?
Let S be the set of all sequence of two rolls of a
die. Let Y be a number of times when the value
of a roll was over 3
What is var(Y)?
58
Covariance
Covariance of two random variables X and Y is
the expectation of a product of (X-E[X]) and (Y-
E[Y]) :
Correlation of X and Y is the expectation of a
product of (X-E[X])/(X) and (Y-E[Y])/(Y) :
59
Meaning of Correlation
-1
XY
1
Correlation for random variables has the
following interpretation

XY
> 0 means that when X tends to be
larger than its mean, Y will tend to be
larger than its mean as well

XY
< 0 means that when X > E[X], then Y
tends to be smaller than its mean XY
60
Correlation: Example
Let S be the set of all sequence of two
rolls of a die. Let X be the sum of the
number of dots on the two rolls.
Let S be the set of all sequence of two
rolls of a die. Let Y be a number of times
when the value of a roll was over 3
What is covariance Cov(X,Y)?
61
Operating with Variances
Rule 1: If X is a random variable with variance var(X)
and a and b are fixed constants, then
var(a + bX) = b
2
var(X)
Rule2: If X and Y are random variables and X and Y
have correlation then
var(X + Y) = var(X)+ var(Y)+2(X)(Y)
var(X - Y) = var(X)+ var(Y)-2(X)(Y)
var(bX - cY) = var(X)+ var(Y)-2bc(X)(Y)
If X and Y are not correlated, i.e. = 0 we have
var(X + Y) = var(X - Y) = var(X)+ var(Y)
var(bX - cY) = var(X)+ var(Y)
62
Bernoulli Distribution
The outcome of an experiment can either be
success (i.e., 1) and failure (i.e., 0).
Pr(X=1) = p, Pr(X=0) = 1-p, or
E[X] = p, Var(X) = p(1-p)
1
( ) (1 )
x x
p x p p

=
63
Binomial Distribution
n draws of a Bernoulli distribution
X
i
~Bernoulli(p), X=
i=1
n
X
i
, X~Bin(p, n)
Random variable X stands for the number of
times that experiments are successful.
E[X] = np, Var(X) = np(1-p)
(1 ) 1, 2,...,
Pr( ) ( )
0 otherwise
x n x
n
p p x n
X x p x x

| |
=

|
= = =

\

64
Plots of Binomial Distribution

You might also like