Professional Documents
Culture Documents
𝑛 𝑛 𝑛
1 1 1
𝑠= lj 2=
(𝑥𝑖 − 𝑥) [ 𝑥𝑖 2 − ( 𝑥𝑖 )2 ]
𝑛−1 𝑛−1 𝑛
𝑖=1 𝑖=1 𝑖=1 3
Properties of Normal Distribution Curve
4
Graphic Displays of Basic Statistical Descriptions
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in
the plane
5
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars 40
6
Histogram example: uneven width
Unit price ($) 40 43 47 … 74 75 78 … 115 117 120
Count of 275 300 250 … 360 515 540 … 320 270 350
items sold
300
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑡𝑒𝑚 𝑠𝑜𝑙𝑑
200
𝑤𝑖𝑑𝑡ℎ
9000
4350
100 2900
8
Histograms Often Tell More than Boxplots
9 Q1 Q2 Q3
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
10
Positively and Negatively Correlated Data
11
Uncorrelated Data
12
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
13
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p x11 ... x1f ... x1p
dimensions
... ... ... ... ...
• Two modes x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
• Dissimilarity matrix
• n data points, but 0
registers only the d(2,1) 0
distance d(3,1) d ( 3, 2 ) 0
• A triangular matrix
: : :
• Single mode d ( n,1) d ( n, 2 ) ... ... 0
14
Proximity Measure for Nominal Attributes
3−1
d(1,2) = = 0.67
3
3−2
d(1,3) = = 0.33
3
16
Distance on Numeric Data: Minkowski Distance
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
17
metric
• Claim:
• Minkowski distance is a metric for any h
• The distance defined for nominal attributes is a
metric
• The distance defined for ordinal attributes is a
metric (later)
• The distance defined for mix types attribute is a
metric (later)
18
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
19
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
20
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
21
Ordinal Variables
• An ordinal variable has an order on its values
• Based on the order, we can rank to each value
• Then can treat it like interval-scaled
• For the ith object and fth attribute, replace xif by its rank
𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }
22
𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }
𝑟𝑖𝑓 − 1
Ordinal Variables: example 𝑧𝑖𝑓 =
𝑀𝑓 − 1
23
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, numeric, ordinal
• One may use a weighted formula to combine their effects
𝑝 𝑓 𝑓
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
𝑑(𝑖, 𝑗) = 𝑝 𝑓
Σ𝑓=1 𝛿𝑖𝑗
𝑓 𝑓
• 𝛿𝑖𝑗 is the indicator, and 𝑑𝑖𝑗 is the contribution, of attribute f
to the distance between object i and j
𝑓 𝑓
• 𝛿𝑖𝑗 = 0 if one of 𝑥𝑖𝑓 and 𝑥𝑗𝑓 is missing, 𝛿𝑖𝑗 = 1 otherwise
• f is nominal:
𝑓 𝑓
𝑑𝑖𝑗 = 0 𝑖𝑓 𝑥𝑖𝑓 = 𝑥𝑗𝑓 , 𝑑𝑖𝑗 = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal: Compute ranks rif and then zif , then treat zif as numeric
24
𝑝 𝑓 𝑓
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
Mixed attribute types: example 𝑑(𝑖, 𝑗) = 𝑝
Σ𝑓=1 𝛿𝑖𝑗
𝑓
O1
26
Cosine Similarity
Directions for vectors can be measure by their angle
O1
O1 ∙ O2
sim 𝑂1 , 𝑂2 = cos 𝛼 =
𝛼 O2 O1 O2
𝑂1 = σ𝑝𝑖=1 𝑥𝑖2
𝑂2 = σ𝑝𝑖=1 𝑦𝑖2
27
Cosine Similarity: example
• A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.
28
Cosine Similarity: example
29