Professional Documents
Culture Documents
Concepts and Techniques: - Chapter 2
Concepts and Techniques: - Chapter 2
— Chapter 2 —
Data Visualization
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal (categorical): categories, states, or “names of
things”
Hair_color = {black, blondস্বর্কেশী,
ণ brown, grey, red,
white}
marital status, occupation, ID numbers, zip codes
enumerations
Because nominal attribute values do not have any
meaningful order about them and are not quantitative, it
makes no sense to find the mean (average) value or
median (middle) value for such an attribute, given a set of
objects.
One thing that is of interest, however, is the attribute’s
most commonly occurring value. This value, known as the 6
Attribute Types
Binary
Nominal attribute with only 2 states (0 and 1)
0 typically means that the attribute is absent, and 1 means that it
is present.
Binary attributes are referred to as Boolean if the two states
that the patient smokes, while 0 indicates that the patient does
not
patient undergoes a medical test that has two possible outcomes.
result of the test for the patient is positive, while 0 means the
result is negative
7
Attribute Types
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
The values have a meaningful sequence (which corresponds to
8
Ordinal Attribute
Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured
objectively;
thus ordinal attributes are often used in surveys for
ratings.
In one survey, participants were asked to rate how
satisfied they were as customers.
Customer satisfaction had the following ordinal
categories: 0: very dissatisfied, 1: somewhat dissatisfied,
2: neutral, 3: satisfied, and 4: very satisfied.
The central tendency of an ordinal attribute can be
represented by its mode and its median (the middle value
in an ordered sequence), but the mean cannot be
defined.
9
Ordinal Attribute
nominal, binary, and ordinal attributes are
qualitative
they describe a feature of an object without
giving an actual size or quantity.
The values of such qualitative attributes are
typically words representing categories.
If integers are used, they represent computer
codes for the categories, as opposed to
measurable quantities
(e.g., 0 for small drink size, 1 for medium, and
2 for large)
10
Numeric Attribute Types
A numeric attribute is quantitative
it is a measurable quantity, represented in integer or real
values
Numeric attributes can be interval-scaled or ratio-scaled.
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities 11
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
Data Visualization
Summary
13
Basic Statistical Descriptions of Data
For data preprocessing to be successful, it is
essential to have an overall picture of your data.
Basic statistical descriptions can be used to
identify properties of the data and highlight
which data values should be treated as noise or
outliers.
14
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x
i i
Trimmed mean: chopping extreme values x i 1
n
Median: w
i 1
i
18
Measuring the Dispersion of Data
The 2-quantile is the data point dividing the lower and
upper halves of the data distribution.
It corresponds to the median.
The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-
fourth of the data distribution.
They are more commonly referred to as quartiles.
19
Measuring the Dispersion of Data
The quartiles give an indication of a distribution’s center, spread,
and shape.
The first quartile, denoted by Q1, is the 25th percentile.
It cuts off the lowest 25% of the data.
20
Interquartile range (IQR)
21
Interquartile range (IQR)
22
Measuring the Dispersion of Data (Five-
Number Summary, Boxplots, and Outliers)
A common rule of thumb for identifying suspected outliers
is to single out values falling at least 1.5 IQR above the
third quartile or below the first quartile.
The five-number summary of a distribution consists of
the
median (Q2),
Q3, and
24
Measuring the Dispersion of Data (Variance
and Standard Deviation)
Variance and standard deviation are measures of
data dispersion.
They indicate how spread out a data distribution is.
A low standard deviation means that the data
observations tend to be very close to the mean,
while a high standard deviation indicates that
25
Measuring the Dispersion of Data (Variance
and Standard Deviation)
The variance of N observations, x1,x2, : : : ,xN,
for a numeric attribute X is
26
Chapter 2: Getting to Know Your Data
Summary
27
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
are
Lower when objects are more alike
28
Data Matrix and Dissimilarity Matrix
Data matrix
n data points with p
29
How is dissimilarity computed between objects
described by nominal attributes?
30
How is dissimilarity computed between
objects described by nominal attributes?
31
How is dissimilarity computed between
objects described by nominal attributes?
32
Proximity Measure for Nominal Attributes
33
“how can we compute the dissimilarity
between two binary attributes?”
One approach involves computing a
dissimilarity matrix from the given binary
data.
If all binary attributes are thought of as
having the same weight, we have the 2X2
contingency table of Table 2.3, where
q is the number of attributes that
where p = q + r + s +t . 34
“how can we compute the dissimilarity
between two binary attributes?”
Recall that for symmetric binary attributes,
each state is equally valuable.
Dissimilarity that is based on symmetric
binary attributes is called symmetric
binary dissimilarity.
If objects i and j are described by
symmetric binary attributes, then the
dissimilarity between i and j is
35
“how can we compute the dissimilarity
between two binary attributes?”
For asymmetric binary attributes, the two
states are not equally important, such as the
positive (1) and negative (0) outcomes of a
disease test.
Given two asymmetric binary attributes, the
agreement of two 1s (a positive match) is
then considered more significant than that of
two 0s (a negative match).
Therefore, such binary attributes are often
considered “monary” (having one state).
The dissimilarity based on these attributes is
called asymmetric binary dissimilarity, where
the number of negative matches, t , is
considered unimportant and is thus ignored
in the following computation:
36
“how can we compute the dissimilarity
between two binary attributes?”
Complementarily, we can measure the
difference between two binary attributes
based on the notion of similarity instead of
dissimilarity.
For example, the asymmetric binary similarity
between the objects i and j can be computed
as
37
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
38
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N be 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
These measurements suggest that Jim and Mary are unlikely to have a
similar disease because they have the highest dissimilarity value among
the three pairs. Of the three patients, Jack and Mary are the most likely
to have a similar disease 39
Dissimilarity of Numeric Data: Minkowski Distance
40
Dissimilarity of Numeric Data: Minkowski Distance
41
Dissimilarity of Numeric Data: Minkowski Distance
42
Dissimilarity of Numeric Data: Minkowski Distance
number.
Identity of indiscernibles: d(i, i)= 0: The distance of an
object to itself is 0.
Symmetry: d(i, j)= d(j,i): Distance is a symmetric
function.
Triangle inequality: d(i, j)= d(i, k)+ d(k, j)= : Going
metric.
43
Standardizing Numeric Data
x
z
Z-score:
X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
the distance between the raw score and the population mean in
units of the standard deviation
negative when the raw score is below the mean, “+” when above
An alternative way: Calculate the mean absolute deviation
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where m 1 (x x ... x )
n 1f 2 f xif m f
.
f nf
zif sf
standardized measure (z-score):
Using mean absolute deviation is more robust than using standard
deviation
44
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
45
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
46
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
47
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
48
Ordinal Variables
49
Attributes of Mixed Type
50
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
51
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
52
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
53
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
Many types of data sets, e.g., numerical, text, graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Data visualization: map data onto graphical primitives
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of research.
54
References
W. Cleveland, Visualizing Data, Hobart Press, 1993
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
55