You are on page 1of 55

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0


 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images 1 Bread, Coke, Milk
 Temporal data: time-series
2 Beer, Bread
 Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
 Genetic sequence data
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:
3
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
4
Attributes

 Attribute (or dimensions, features, variables):


a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

5
Attribute Types
 Nominal (categorical): categories, states, or “names of
things”
 Hair_color = {black, blondস্বর্কেশী,
ণ brown, grey, red,
white}
 marital status, occupation, ID numbers, zip codes

 Each value represents some kind of category, code, or state

 The values do not have any meaningful order.

 In computer science, the values are also known as

enumerations
 Because nominal attribute values do not have any
meaningful order about them and are not quantitative, it
makes no sense to find the mean (average) value or
median (middle) value for such an attribute, given a set of
objects.
 One thing that is of interest, however, is the attribute’s
most commonly occurring value. This value, known as the 6
Attribute Types
 Binary
 Nominal attribute with only 2 states (0 and 1)
 0 typically means that the attribute is absent, and 1 means that it

is present.
 Binary attributes are referred to as Boolean if the two states

correspond to true and false.


 Given the attribute smoker describing a patient object, 1 indicates

that the patient smokes, while 0 indicates that the patient does
not
 patient undergoes a medical test that has two possible outcomes.

 The attribute medical test is binary, where a value of 1 means the

result of the test for the patient is positive, while 0 means the
result is negative

7
Attribute Types
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings
 The values have a meaningful sequence (which corresponds to

increasing drink size); however, we cannot tell from the values


how much bigger, say, a medium is than a large.

8
Ordinal Attribute
 Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured
objectively;
 thus ordinal attributes are often used in surveys for
ratings.
 In one survey, participants were asked to rate how
satisfied they were as customers.
 Customer satisfaction had the following ordinal
categories: 0: very dissatisfied, 1: somewhat dissatisfied,
2: neutral, 3: satisfied, and 4: very satisfied.
 The central tendency of an ordinal attribute can be
represented by its mode and its median (the middle value
in an ordered sequence), but the mean cannot be
defined.
9
Ordinal Attribute
 nominal, binary, and ordinal attributes are
qualitative
 they describe a feature of an object without
giving an actual size or quantity.
 The values of such qualitative attributes are
typically words representing categories.
 If integers are used, they represent computer
codes for the categories, as opposed to
measurable quantities
 (e.g., 0 for small drink size, 1 for medium, and

2 for large)
10
Numeric Attribute Types
 A numeric attribute is quantitative
 it is a measurable quantity, represented in integer or real
values
 Numeric attributes can be interval-scaled or ratio-scaled.
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities 11
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete


attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and


represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables
12
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

13
Basic Statistical Descriptions of Data
 For data preprocessing to be successful, it is
essential to have an overall picture of your data.
 Basic statistical descriptions can be used to
identify properties of the data and highlight
which data values should be treated as noise or
outliers.

14
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x

i i
 Trimmed mean: chopping extreme values x i 1
n
 Median: w
i 1
i

 Middle value if odd number of values, or average of


the middle two values otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq)l
median  L1  ( ) width
 Mode freqmedian
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode  3  (mean  median)
15
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data

positively skewed negatively skewed

March 26, 2019 Data Mining: Concepts and Techniques 16


Measuring the Dispersion of Data

 Range, Quartiles, and Interquartile Range


 Let x1,x2, : : : ,xN be a set of observations for some numeric
attribute, X.
 The range of the set is the difference between the

largest (max()) and smallest (min()) values.


 Suppose that the data for attribute X are sorted in
increasing numeric order.
 Imagine that we can pick certain data points so as to split
the data distribution into equal-size consecutive sets, as in
Figure 2.2.
 These data points are called quantiles.

 Quantiles are points taken at regular intervals of a data


distribution
17
Measuring the Dispersion of Data

18
Measuring the Dispersion of Data
 The 2-quantile is the data point dividing the lower and
upper halves of the data distribution.
 It corresponds to the median.

 The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-
fourth of the data distribution.
 They are more commonly referred to as quartiles.

 The 100-quantiles are more commonly referred to as


percentiles; they divide the data distribution into 100
equal-sized consecutive sets.
 The median, quartiles, and percentiles are the most widely
used forms of quantiles.

19
Measuring the Dispersion of Data
 The quartiles give an indication of a distribution’s center, spread,
and shape.
 The first quartile, denoted by Q1, is the 25th percentile.
 It cuts off the lowest 25% of the data.

 The third quartile, denoted by Q3, is the 75th percentile—


 it cuts off the lowest 75% (or highest 25%) of the data.

 The second quartile is the 50th percentile.


 As the median, it gives the center of the data distribution.

 The distance between the first and third quartiles is a simple


measure of spread that gives the range covered by the middle
half of the data.
 This distance is called the interquartile range (IQR) and is
defined as IQR = Q3 -Q1.

20
Interquartile range (IQR)

21
Interquartile range (IQR)

22
Measuring the Dispersion of Data (Five-
Number Summary, Boxplots, and Outliers)
 A common rule of thumb for identifying suspected outliers
is to single out values falling at least 1.5 IQR above the
third quartile or below the first quartile.
 The five-number summary of a distribution consists of
the
 median (Q2),

 the quartiles Q1 and

 Q3, and

 the smallest and

 largest individual observations,

 written in the order of Minimum, Q1, Median, Q3,


Maximum.
23
Measuring the Dispersion of Data (Five-
Number Summary, Boxplots, and Outliers)
 Boxplots are a popular way of
visualizing a distribution.
 A boxplot incorporates the five-
number summary as follows:
 Typically, the ends of the

box are at the quartiles so


that the box length is the
interquartile range.
 The median is marked by a line
within the box.
 Two lines (called whiskers)
outside the box extend to the
smallest (Minimum) and largest
(Maximum) observations.

24
Measuring the Dispersion of Data (Variance
and Standard Deviation)
 Variance and standard deviation are measures of
data dispersion.
 They indicate how spread out a data distribution is.
 A low standard deviation means that the data
observations tend to be very close to the mean,
 while a high standard deviation indicates that

the data are spread out over a large range of


values.

25
Measuring the Dispersion of Data (Variance
and Standard Deviation)
 The variance of N observations, x1,x2, : : : ,xN,
for a numeric attribute X is

 where x bar is the mean value of the observations


 The standard deviation, δ, of the observations is
the square root of the variance, δ2

26
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

27
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)


 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

28
Data Matrix and Dissimilarity Matrix
 Data matrix
 n data points with p

dimensions  x11 ... x1f ... x1p 


 Two modes  
 ... ... ... ... ... 
 Dissimilarity matrix x ... xif ... xip 
 n data points, but registers  i1 
only the distance  ... ... ... ... ... 
x ... xnf ... xnp 
 A triangular matrix  n1 
 Single mode

 data matrix is often called a  0 


two-mode matrix.  d(2,1) 0 
 
 The dissimilarity matrix  d(3,1) d ( 3,2) 0 
contains one kind of entity  
(dissimilarities) and so is  : : : 
called a one-mode matrix. d ( n,1) d ( n,2) ... ... 0

29
How is dissimilarity computed between objects
described by nominal attributes?

30
How is dissimilarity computed between
objects described by nominal attributes?

31
How is dissimilarity computed between
objects described by nominal attributes?

32
Proximity Measure for Nominal Attributes

 Can take 2 or more states, e.g., red, yellow, blue,


green (generalization of a binary attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p 
p
m

 Method 2: Use a large number of binary attributes


 creating a new binary attribute for each of the
nominal states

33
“how can we compute the dissimilarity
between two binary attributes?”
 One approach involves computing a
dissimilarity matrix from the given binary
data.
 If all binary attributes are thought of as
having the same weight, we have the 2X2
contingency table of Table 2.3, where
 q is the number of attributes that

equal 1 for both objects i and j ,


 r is the number of attributes that equal

1 for object i but equal 0 for object j ,


 s is the number of attributes that equal

0 for object i but equal 1 for object j ,


and
 t is the number of attributes that equal

0 for both objects i and j .


 The total number of attributes is p,

where p = q + r + s +t . 34
“how can we compute the dissimilarity
between two binary attributes?”
 Recall that for symmetric binary attributes,
each state is equally valuable.
 Dissimilarity that is based on symmetric
binary attributes is called symmetric
binary dissimilarity.
 If objects i and j are described by
symmetric binary attributes, then the
dissimilarity between i and j is

35
“how can we compute the dissimilarity
between two binary attributes?”
 For asymmetric binary attributes, the two
states are not equally important, such as the
positive (1) and negative (0) outcomes of a
disease test.
 Given two asymmetric binary attributes, the
agreement of two 1s (a positive match) is
then considered more significant than that of
two 0s (a negative match).
 Therefore, such binary attributes are often
considered “monary” (having one state).
 The dissimilarity based on these attributes is
called asymmetric binary dissimilarity, where
the number of negative matches, t , is
considered unimportant and is thus ignored
in the following computation:

36
“how can we compute the dissimilarity
between two binary attributes?”
 Complementarily, we can measure the
difference between two binary attributes
based on the notion of similarity instead of
dissimilarity.
 For example, the asymmetric binary similarity
between the objects i and j can be computed
as

 The coefficient sim(i, j ) is called the Jaccard


coefficient and is popularly referenced in the
literature.

37
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i

 Distance measure for symmetric


binary variables:
 Distance measure for asymmetric
binary variables:
 Jaccard coefficient (similarity
measure for asymmetric binary
variables):

38
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N be 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
 These measurements suggest that Jim and Mary are unlikely to have a
similar disease because they have the highest dissimilarity value among
the three pairs. Of the three patients, Jack and Mary are the most likely
to have a similar disease 39
Dissimilarity of Numeric Data: Minkowski Distance

40
Dissimilarity of Numeric Data: Minkowski Distance

41
Dissimilarity of Numeric Data: Minkowski Distance

42
Dissimilarity of Numeric Data: Minkowski Distance

 Both the Euclidean and the Manhattan distance satisfy the


following mathematical properties:
 Non-negativity: d(i, j)>=0: Distance is a non-negative

number.
 Identity of indiscernibles: d(i, i)= 0: The distance of an

object to itself is 0.
 Symmetry: d(i, j)= d(j,i): Distance is a symmetric

function.
 Triangle inequality: d(i, j)= d(i, k)+ d(k, j)= : Going

directly from object i to object j in space is no more


than making a detour over any other object k.
 A measure that satisfies these conditions is known as

metric.
43
Standardizing Numeric Data
x
z   
 Z-score:
 X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
 the distance between the raw score and the population mean in
units of the standard deviation
 negative when the raw score is below the mean, “+” when above
 An alternative way: Calculate the mean absolute deviation
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where m  1 (x  x  ...  x )
n 1f 2 f xif  m f
.
f nf

zif  sf
 standardized measure (z-score):
 Using mean absolute deviation is more robust than using standard
deviation

44
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

45
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
46
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are

different between two binary vectors


d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.


 This is the maximum difference between any component

(attribute) of the vectors

47
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
48
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank rif {1,...,M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables

49
Attributes of Mixed Type

 A database may contain all attribute types


 Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
 Compute ranks rif and r 1
zif  if

 Treat zif as interval-scaled M 1 f

50
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

51
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

52
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

53
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of research.

54
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
55

You might also like