Lec2 2-Dataset2

Data Sets (2)
Measuring the Dispersion of Data

• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
𝑁 𝑁
1 1
𝜎2 = ෍(𝑥𝑖 − 𝜇)2 = ෍ 𝑥𝑖 2 − 𝜇2
𝑁 𝑁
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
2
1 2
1 1
𝑠 = ෍(𝑥𝑖 − 𝑥)
ǉ = [෍ 𝑥𝑖 − (෍ 𝑥𝑖 )2 ]
2
𝑛−1 𝑛−1 𝑛
𝑖=1 𝑖=1 𝑖=1
N: population size; n: sample size

• Standard deviation s (or σ) is the square root of variance s2
(or σ2)
𝑁 𝑁
1 1
𝜎= ෍(𝑥𝑖 − 𝜇)2 = ෍ 𝑥𝑖 2 − 𝜇 2
𝑁 𝑁
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
1 1 1
𝑠= ǉ 2=
෍(𝑥𝑖 − 𝑥) [෍ 𝑥𝑖 2 − (෍ 𝑥𝑖 )2 ]
𝑛−1 𝑛−1 𝑛
𝑖=1 𝑖=1 𝑖=1 3
Properties of Normal Distribution Curve
• The normal (distribution) curve

• From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
• From μ–2σ to μ+2σ: contains about 95% of it
• From μ–3σ to μ+3σ: contains about 99.7% of it
4
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis represents frequencies or frequencies per

unit
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in
the plane
5
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars 40
• It shows what proportion of cases fall 35

into each of several categories 30
• Differs from a bar chart in that it is the 25
area of the bar that denotes the value, 20
not the height as in bar charts, a crucial
15
distinction when the categories are not
of uniform width 10
• The categories are usually specified as 5

non-overlapping intervals of some 0
10000 30000 50000 70000 90000
variable. The categories (bars) must be
adjacent
6
Histogram example: uneven width
Unit price ($) 40 43 47 … 74 75 78 … 115 117 120
Count of 275 300 250 … 360 515 540 … 320 270 350
items sold
300
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑡𝑒𝑚 𝑠𝑜𝑙𝑑
200
𝑤𝑖𝑑𝑡ℎ
9000
4350
100 2900
40-59 60-99 100-120

unit price
7
Histogram example: even width
Unit price ($) 40 43 47 … 74 75 78 … 115 117 120
Count of 275 300 250 … 360 515 540 … 320 270 350
items sold
8
Histograms Often Tell More than Boxplots
◼ The two histograms

shown in the left may
have the same boxplot
representation
◼ The same values
Q1 Q2 Q3 for: min, Q1,
median, Q3, max
◼ But they have rather
different data
distributions
9 Q1 Q2 Q3
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
10
Positively and Negatively Correlated Data
• The left half fragment is positively

correlated
• The right half is negative correlated
11
Uncorrelated Data
12
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
13
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p  x11 ... x1f ... x1p 
dimensions  
 ... ... ... ... ... 
• Two modes x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
• Dissimilarity matrix
• n data points, but  0 
registers only the  d(2,1) 0 
 
distance  d(3,1) d ( 3, 2 ) 0 
• A triangular matrix  
 : : : 
• Single mode d ( n,1) d ( n, 2 ) ... ... 0
14
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green

(generalization of a binary attribute)
• Simple matching
• m: # of matches, p: total # of variables
𝑝−𝑚
𝑑(𝑖, 𝑗) =
𝑝
• Note: for each attribute, we assume all its values are

equally important
• Otherwise, higher weights should be given to ‘more
important’ values (detail omitted)
15
Distance measure for nominal attributes
• Consider the following data set
Person id Gender language Hair-color
1 M English brown
2 F English black
3 M Spanish brown
3−1
 d(1,2) = = 0.67
3
3−2
 d(1,3) = = 0.33
3
16
Distance on Numeric Data: Minkowski Distance
• Minkowski distance: A general distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
17
metric
• Claim:
• Minkowski distance is a metric for any h
• The distance defined for nominal attributes is a
metric
• The distance defined for ordinal attributes is a
metric (later)
• The distance defined for mix types attribute is a
metric (later)
18
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
• h → . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component (attribute) of the vectors
19
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
• h → . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component (attribute) of the vectors
20
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
21
Ordinal Variables
• An ordinal variable has an order on its values
• Based on the order, we can rank to each value
• Then can treat it like interval-scaled
• For the ith object and fth attribute, replace xif by its rank
𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }
• map the rank of each variable onto [0, 1] by replacing rif by

𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1
• compute the dissimilarity using methods for interval-scaled
variables
22
𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }
𝑟𝑖𝑓 − 1
Ordinal Variables: example 𝑧𝑖𝑓 =
𝑀𝑓 − 1
 Consider the data set to the right

day temperature rank mapping
 Rank four values:
1 very warm 4 1
 very cold: 1 2 cold 2 0.33
 cold: 2 3 warm 3 0.66
 warm: 3 4 very cold 1 0
 very warm: 4
 Map to [0, 1] interval:
 Very cold: (1-1)/(4-1) = 0
 Cold: (2-1)/(4-1) = 0.33
 Warm: (3-1)/(4-1) = 0.66
 Very warm: (4-1)/(4-1) = 1 0
0.67 0
 Dissimilarity matrix: 0.34 0.33 0
1 0.33 0.66 0
23
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, numeric, ordinal
• One may use a weighted formula to combine their effects
𝑝 𝑓 𝑓
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
𝑑(𝑖, 𝑗) = 𝑝 𝑓
Σ𝑓=1 𝛿𝑖𝑗
𝑓 𝑓
• 𝛿𝑖𝑗 is the indicator, and 𝑑𝑖𝑗 is the contribution, of attribute f
to the distance between object i and j
𝑓 𝑓
• 𝛿𝑖𝑗 = 0 if one of 𝑥𝑖𝑓 and 𝑥𝑗𝑓 is missing, 𝛿𝑖𝑗 = 1 otherwise
• f is nominal:
𝑓 𝑓
𝑑𝑖𝑗 = 0 𝑖𝑓 𝑥𝑖𝑓 = 𝑥𝑗𝑓 , 𝑑𝑖𝑗 = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal: Compute ranks rif and then zif , then treat zif as numeric
24
𝑝 𝑓 𝑓
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
Mixed attribute types: example 𝑑(𝑖, 𝑗) = 𝑝
Σ𝑓=1 𝛿𝑖𝑗
𝑓
nominal ordinal nominal numeric

Car id color age (year) model price ($)
1 black > 10 Honda 22,000
2 red 5 – 10 Honda 30,000
3 grey > 10 Buick 40,000
4 red <5 Ford 25,000
𝑐𝑜𝑙𝑜𝑟 𝑑 𝑐𝑜𝑙𝑜𝑟 +𝛿 𝑎𝑔𝑒 𝑎𝑔𝑒 𝑚𝑜𝑑𝑒𝑙 𝑑 𝑚𝑜𝑑𝑒𝑙 +𝛿 𝑝𝑟𝑖𝑐𝑒 𝑝𝑟𝑖𝑐𝑒
𝛿12 12 12 𝑑12 +𝛿12 12 12 𝑑12
 𝑑 1,2 = 𝑐𝑜𝑙𝑜𝑟 +𝛿 𝑎𝑔𝑒 𝑚𝑜𝑑𝑒𝑙 +𝛿 𝑝𝑟𝑖𝑐𝑒
𝛿12 12 +𝛿12 12
𝑐𝑜𝑙𝑜𝑟 𝑎𝑔𝑒
𝑚𝑜𝑑𝑒𝑙 𝑝𝑟𝑖𝑐𝑒
 𝛿12 = 𝛿12 = 𝛿12 = 𝛿12 =1
𝑐𝑜𝑙𝑜𝑟 𝑚𝑜𝑑𝑒𝑙
 𝑑12 = 1, 𝑑12 =0
 For age:
 rank the values for age: ‘< 5’ → 1, ‘5 – 10’ → 2, ‘> 10’ → 3
 Normalize to [0,1]: 1 → 0, 2 → 0.5, 3 → 1
𝑎𝑔𝑒
 𝑑12 = 1 − 0.5 = 0.5
𝑝𝑟𝑖𝑐𝑒 30000−22000
 𝑑12 = = 0.44
40000−22000
1×1+1×0.5+1×0+1×0.44
 𝑑 1,2 = = 0.485 25
4
Cosine Similarity
• Objects viewed as vectors
• Similarity measures emphasize on directions
O1
more similar than O1

O2
O2
more similar than
26
Cosine Similarity
 Directions for vectors can be measure by their angle
O1
O1 ∙ O2
sim 𝑂1 , 𝑂2 = cos 𝛼 =
𝛼 O2 O1 O2
where • indicates vector dot product, ||O|| is the length of vector O

 Let 𝑂1 = 𝑥1 , ⋯ , 𝑥𝑝 , 𝑂2 = 𝑦1 , ⋯ , 𝑦𝑝 , then
𝑝
 𝑂1 ∙ 𝑂2 = σ𝑖=1 𝑥𝑖 𝑦𝑖
 𝑂1 = σ𝑝𝑖=1 𝑥𝑖2
 𝑂2 = σ𝑝𝑖=1 𝑦𝑖2
27
Cosine Similarity: example
• A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.
 Other vector objects: gene features in micro-arrays, …

 Applications: information retrieval, biologic taxonomy,
gene feature mapping, ...
28
Cosine Similarity: example
• Find the similarity between documents d1 and d2 where

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
• Sim(d1, d2) = cos(d1, d2) = (d1 • d2) /||d1|| ||d2||
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
sim(d1, d2 ) = 0.94
29

Lec2 2-Dataset2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec2 2-Dataset2

Uploaded by

Copyright:

Available Formats

Data Sets (2)

Measuring the Dispersion of Data

N: population size; n: sample size

• The normal (distribution) curve

• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis represents frequencies or frequencies per

• It shows what proportion of cases fall 35

• The categories are usually specified as 5

40-59 60-99 100-120

◼ The two histograms

• The left half fragment is positively

• The right half is negative correlated

• Can take 2 or more states, e.g., red, yellow, blue, green

• Note: for each attribute, we assume all its values are

• Minkowski distance: A general distance measure

• h → . “supremum” (Lmax norm, L norm) distance.

• h → . “supremum” (Lmax norm, L norm) distance.

• map the rank of each variable onto [0, 1] by replacing rif by

 Consider the data set to the right

nominal ordinal nominal numeric

more similar than O1

more similar than

where • indicates vector dot product, ||O|| is the length of vector O

 Other vector objects: gene features in micro-arrays, …

• Find the similarity between documents d1 and d2 where

You might also like