Professional Documents
Culture Documents
Data (Understanding
Data)
1
Getting to know your data
Getting to know about your data
It’s tempting to jump straight into mining, but first, we need
to get the data ready.
Knowledge about your data is useful for data preprocessing.
What are the types of attributes or fields that make up your
data?
What kind of values does each attribute have? Which
attributes are discrete, and which are continuous-valued?
What do the data look like? How are the values distributed?
Are there ways we can visualize the data to get a better
sense of it all? Can we spot any outliers?
Can we measure the similarity of some data objects with
respect to others?
Gaining such insight into the data will help with the subsequent
analysis.
2
Getting to know your data
Getting to know about your data
3
Getting to know your data
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
crosstabs
timeout
season
coach
game
score
team
ball
lost
Document data: text documents: term-
pla
wi
n
y
frequency vector
Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0
Molecular Structures
Ordered
Video data: sequence of images
TID Items
Temporal data: time-series
Sequential Data: transaction sequences 1 Bread, Coke, Milk
Genetic sequence data 2 Beer, Bread
Spatial, image and multimedia: 3 Beer, Coke, Diaper, Milk
Spatial data: maps
4 Beer, Bread, Diaper, Milk
Image data:
5 Coke, Diaper, Milk
Video data:
4
Getting to know your data
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data points,
objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes
5
Getting to know your data
Attributes
Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
6
Getting to know your data
Attribute Types
Nominal: categories, states, or “names of things”
A nominal scale, as the name implies, is simply some placing of
data into categories, without any order or structure.
Hair_color = {black, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
7
Getting to know your data
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts
8
Getting to know your data
Discrete Vs Contineous Attribute Types
Discrete Attribute
Has only a finite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
9
Getting to know your data
Getting to Know Your Data
10 10
Getting to know your data
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
11
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
1 n
x xi x
n i 1 N
Weighted arithmetic mean: n
Trimmed mean: chopping extreme values w x i i
Median: x i 1
n
n / 2 ( freq )l
Mode median L1 ( ) width
freq median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
mean mode 3 (mean median)
12 12
Getting to know your data
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, positively
and negatively skewed data
13
Getting to know your data
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quantiles: Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive sets.
Quartiles: The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth of the
data distribution. They are more commonly referred to as quartiles. Q 1
(25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR
14
Getting to know your data
Measuring the Dispersion of Data
15
Boxplot Analysis
16 16
Getting to know your data
Boxplot Analysis
Boxplot for the unit price data for items sold at four
branches of AllElectronics during a given time
period.
17
Getting to know your data
Boxplot Analysis
18
Getting to know your data
Measuring the Dispersion of Data
Variance and standard deviation (sample: s, population: σ) : A measure of
spread.
Variance and standard deviation are measures of data dispersion.
They indicate how spread out a data distribution is.
A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large
range of values.
Variance:
n
(algebraic, nscalable computation)
1 1 1 n 1 n 2 1 n 2
( xi ) xi s
2 2 2 2 2
( xi x )
2
[ xi ( xi ) ]
N i 1 N i 1 n 1 i 1 n 1 i 1 n i 1
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
The basic properties of the standard deviation, σ, are as follows:
σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
σ =0 only when there is no spread, that is, when all observations have the same value.
Otherwise, σ > 0.
19
Getting to know your data
Histrograms
“Histos” means pole or mast, and “gram” means
chart, so a histogram is a chart of poles.
If X is nominal, such as automobile model or item type,
then a pole or vertical bar is drawn for each known
value of X. The height of the bar indicates the
frequency (i.e., count) of that X value. The resulting
graph is more commonly known as a bar chart
If X is numeric, the term histogram is preferred. The
range of values for X is partitioned into disjoint
consecutive subranges. The subranges, referred to as
buckets or bins, are disjoint subsets of the data
distribution for X. The range of a bucket is known as
the width. Typically, the buckets are of equal width.
20
Getting to know your data
21
Histrograms
Getting to know your data
Scatter Plots
A scatter plot is one of the most effective
graphical methods for determining if there
appears to be a relationship, pattern, or trend
between two numeric attributes.
Provides a first look at bivariate data to see
clusters of points, outliers, etc.
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
Two attributes, X, and Y, are correlated if one
attribute implies the other. Correlations can be
positive, negative, or null (uncorrelated).
22
Getting to know your data
23
Scatter Plot
Getting to know your data
Scatter Plots
24
Getting to know your data
Three cases where there is no observed correlation
between the two plotted attributes in each of the
data sets.
25
Getting to know your data
Getting to Know Your Data
26 26
Getting to know your data
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
are
Lower when objects are more alike
27
Getting to know your data
Data Matrix
x11 ... x1f ... x1p
... ... ... ... ...
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
28
Getting to know your data
Dissimilarity Matrix
0
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
29
Getting to know your data
Data and Dissimilarity Matrix
Measures of similarity can often be expressed as
a function of measures of dissimilarity.
For example, Sim(i, j)= 1-d(i,j), where Sim(i, j) is the similarity
between objects i and j. Throughout the rest of this
A data matrix is made up of two entities or
“things,” namely rows (for objects) and columns
(for attributes). Therefore, the data matrix is
often called a two-mode matrix.
The dissimilarity matrix contains one kind of
entity (dissimilarities) and so is called a one-
mode matrix.
30
Getting to know your data
Proximity Measures for Nominal Attributes
31
Getting to know your data
Proximity Measures for Nominal Attributes
32
Getting to know your data
Proximity Measures for Nominal Attributes
Suppose that we have
the sample data
where test-1 is
categorical.
Let’s compute the
dissimilarity the matrix p
d (i, j) p m
So that d(i, j) evaluates to 0
if objects i and j match, and
1 if the objects differ. Thus,
Since here we have one
categorical variable,
test-1, we set p = 1 in
33
Getting to know your data
Proximity Measures for Nominal Attributes
Method 2: use a large number of binary variables
Creating a new asymmetric binary variable for
each of the nominal states
For an object with a given state value, the binary
variable representing that state is set to 1, while
the remaining binary variables are set to 0.
For example, to encode the categorical variable map
_color, a binary variable can be created for each of the
five colors listed above.
For an object having the color yellow, the yellow variable
is set to 1, while the remaining four variables are set to
0.
34
Getting to know your data
Proximity Measure for Binary Attributes
A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and 1 means that it is
present.
Given the variable smoker describing a patient,
1 indicates that the patient smokes
0 indicates that the patient does not.
Treating binary variables as if they are interval-scaled can
lead to misleading clustering results.
Therefore, methods specific to binary data are necessary
for computing dissimilarities.
If all binary variables are thought of as having the same
weight, we have the 2-by-2 contingency table
35
Getting to know your data
Proximity Measure for Binary Attributes
Object j
Object i
where
q is the number of variables that equal 1 for both objects i and j,
r is the number of variables that equal 1 for object i but that are 0 for
object j,
s is the number of variables that equal 0 for object i but equal 1 for
object j, and
t is the number of variables that equal 0 for both objects i and j.
p is the total number of variables, p = q+r+s+t.
36
Getting to know your data
Proximity Measure for Binary Attributes
A binary variable is symmetric if both of its states are
equally valuable and carry the same weight
Example: the attribute gender having the states male and
female.
Dissimilarity that is based on symmetric binary
variables is called symmetric binary dissimilarity.
The dissimilarity (distance measure) between objects
i and j:
37
Getting to know your data
Proximity Measure for Binary Attributes
A binary variable is asymmetric if the outcomes of the
states are not equally important,
Example: the positive and negative outcomes of a HIV test.
we shall code the most important outcome, which is usually the
rarest one, by 1 (HIV positive)
Given two asymmetric binary variables, the agreement of
two 1s (a positive match) is then considered more
significant than that of two 0s (a negative match).
Therefore, such binary variables are often considered
“monary” (as if having one state).
The dissimilarity based on such variables is called
asymmetric binary dissimilarity
38
Getting to know your data
Proximity Measure for Binary Attributes
The asymmetric binary similarity between the
objects i and j, or sim(i, j), can be computed as
39
Getting to know your data
Proximity Measure for Binary Attributes
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
40
Getting to know your data
Proximity Measure for Binary Attributes
Let the values Y and P be 1, and the value N = 0
Suppose that the distance between objects
(patients) is computed based only on the
asymmetric variables.
The distance between each pair of the three
patients, Jack, Mary, and Jim, is
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim, mary ) 0.75
11 2
41
Getting to know your data
Proximity Measure for Binary Attributes
Let the values Y and P be 1, and the value N = 0
Suppose that the distance between objects
(patients) is computed based only on the
asymmetric variables.
The distance between each pair of the three
patients, Jack, Mary, and Jim, is
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim, mary ) 0.75
11 2
42
Getting to know your data
Proximity Measure for Binary Attributes
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim, mary ) 0.75
11 2
43
Getting to know your data
Proximity Measure for Numeric Attributes
Manhattan (city block) distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
Euclidean distance
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 ip jp
44
Getting to know your data
Proximity Measure for Numeric Attributes
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5
Dissimilarity Matrix
2 x1
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x3
x2 3.61 0
0 2 4
x3 5.1 5.1 0
x4 4.24 1 5.39 0
45
Getting to know your data
Proximity Measure for Numeric Attributes
Minkowski Distance: It is a generalization of
the Euclidean and Manhattan distances. It is
defined as:
Where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
h is a positive integer
It represents the Manhattan distance when q = 1 and
Euclidean distance when q = 2
46
Getting to know your data
Proximity Measure for Numeric Attributes
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 ip jp
h . “supremum” (Lmax norm, L norm) distance.
This is the maximum difference between any component
(attribute) of the vectors
47
Getting to know your data
Proximity Measure for Numeric Attributes
Example: Let x1 = (1, 2) and x2 = (3, 5) represent
two objects
48
Getting to know your data
Proximity Measures for Ordinal Attributes
A discrete ordinal variable resembles a categorical
variable, except that the M states of the ordinal value are
ordered in a meaningful sequence.
Example: professional ranks are often enumerated in a
sequential order, such as assistant, associate, and full for
professors.
Ordinal variables may also be obtained from the
discretization of interval-scaled quantities by splitting the
value range into a finite number of classes.
The values of an ordinal variable can be mapped to
ranks.
Example: suppose that an ordinal variable f has Mf states.
These ordered states define the ranking 1, … , Mf .
49
Getting to know your data
Proximity Measures for Ordinal Attributes
Suppose that f is a variable from a set of ordinal
variables describing n objects.
The dissimilarity computation with respect to f
involves the following steps:
Step 1:
The value of f for the ith object is xif , and f has Mf
ordered states, representing the ranking 1, … , Mf.
Replace each xif by its corresponding rank:
rif {1,..., M f }
50
Getting to know your data
Proximity Measures for Ordinal Attributes
Step 2:
Since each ordinal variable can have a different
number of states, it is often necessary to map the
range of each variable onto [0.0, 1.0] so that each
variable has equal weight.
This can be achieved by replacing the rank r object in
if
the fth variable rif 1
zif
Step 3: M f 1
Dissimilarity can then be computed using any of the
distance measures described for interval-scaled
variables.
51
Getting to know your data
Proximity Measures for Ordinal Attributes
Example: Suppose that we have the sample
data:
52
Getting to know your data
Proximity Measures for Ordinal Attributes
Step 1: if we replace each value for test-2 by its
rank, the four objects are assigned the ranks 3,
1, 2, and 3, respectively.
Step 2: normalizes the ranking by mapping rank
1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
Step 3: we can use, say, the Euclidean distance,
which results in the following dissimilarity matrix:
53
Getting to know your data
Proximity Measures for Mixed Type Attributes
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and
Treat zif as interval-scaled
54
Getting to know your data
Proximity Measures for Mixed Type Attributes
55
Getting to know your data
Proximity Measures for Mixed Type Attributes
For test-1 (which is categorical) is the same as
outlined above
For test-2 (which is ordinal) is the same as
outlined above
We can now calculate the dissimilarity matrices
for the two variables.
56
Getting to know your data
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
57
Getting to know your data
Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
58
Getting to know your data
59
Thanks