Professional Documents
Culture Documents
Disclaimer: Content taken from Han & Kamber slides, Data mining textbooks and Internet
Dr. Kiran Bhowmick Data Mining: Types of Data
Data objects and attribute Type
Patients,
multiple of another
Values are ordered
E.g. – years_of_experience
Number_of_words
Mean
Median
Mode
Midrange
Median
Even no. of observation = middlemost two values
Odd no. = middle observation
For large observations, group the observations in intervals and frequency of each
interval
𝑁Τ2−(σ 𝑓𝑟𝑒𝑞)𝑙
𝑚𝑒𝑑𝑖𝑎𝑛 = 𝐿1 + 𝑤𝑖𝑑𝑡ℎ
𝑓𝑟𝑒𝑞𝑚𝑒𝑑𝑖𝑎𝑛
Where
L1 – lower boundary of the median interval,
N – no. of values
freql – sum of frequency of intervals lower than the median interval,
freqmedian – frequency of median interval
Width – width of median interval
Midrange
Measure central tendency of numeric data set
Average of the largest and smallest values in the set
Given:
60, 61, 61, 61, 62, 62, 63, 63, 63, 63, 63, 64, 64,
64, 64, 64, 64, 64, 65, 65, 66
Median = 63
Mode = 64
fi = frequency of i
th observation
xim = midpoint of i
th x
Median class is the class where the middle point of the total frequency lies.
i.e. 397/2 = 198.5 which lies in 30-40
Hence the median class or median interval is 30-40
Element Value
𝑁Τ2 − (σ 𝑓𝑟𝑒𝑞)𝑙
L1 – lower boundary of the median 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝐿1 + 𝑤𝑖𝑑𝑡ℎ
interval,
30 𝑓𝑟𝑒𝑞𝑚𝑒𝑑𝑖𝑎𝑛
Image source:
https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/graphboard_creating_exa
mples_boxplot.htm
n n
1 1
x 2
2 2 2
( xi ) i
N i 1 N i 1
1 n 1 n 2 1 n
s
2
n 1 i 1
( xi x )
2
[ xi ( xi ) 2 ]
n 1 i 1 n i 1
Low standard deviation means data observations tend to be close to the mean
High standard deviation means data are spread out over a large range of values
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f ... xnf )
.
fi = (i - 0.5) / N
Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
Allows the user to view whether there is a shift in going from one distribution to another
d (i, j) p
p
m
0 0
d(2,1) 0 1 0
d(3,1) d(3,2) 0 1 1 0
d(4,1) d(4,2) d(4,3) 0 0 1 1 0
d (i, j) bc
a bc d
Distance measure for asymmetric binary variables :
d (i, j) bc
a bc
Kiran Bhowmick Data Mining: Types of Data
Proximity measure for Binary Variables
simJaccard (i, j) a
a b c
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p -dimensional
data objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
Kiran Bhowmick Data Mining: Types of Data
Dissimilarity of Numeric Data
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0; distance is non-negative
d(i,i) = 0; distance of object to itself is 0
d(i,j) = d(j,i); distance is symmetric
d(i,j) d(i,k) + d(k,j); triangle inequality
d (i, j) w | x x |2 w | x x |2 ... w p | x x |2 )
1 i1 j1 2 i2 j2 ip jp
rif 1
Ob ID Rank zif zif
1 3 1 M f 1
2 1 0
3 2 0.5 0
4 3 1
1 0
0.5 0.5 0
0 1 0.5 0
Ob ID Log(xif) 0
1 2.65 1.31 0
2 1.34
3 2.21 0.44 0.87 0
4 3.08 0.43 1.74 0.87 0
0
Max = 64, min = 22,
0.55 0
0.45 1.00 0 d(2,1) = (22-45) = 0.55
0.40 0.14 0.86 0 (64-22)
pf 1 ij( f ) d ij( f )
d (i, j )
pf 1 ij( f )
0
0
0.55 0
1 0
0.45 1.00 0
1 1 0 pf 1 ij( f ) d ij( f )
0.40 0.14 0.86 0 d (i, j )
0 1 1 0 pf 1 ij( f )
0 0
1(1)+1(0.50)+1(0.45) = 0.65
1 0 0.85 0 3
0.5 0.5 0 0.65 0.83 0
0 1 0.5 0 0.13 0.71 0.79 0
Kiran Bhowmick Data Mining: Types of Data
Cosine similarity
Defined as
Visualization