Lect 2 Get To Know Your Data

Getting to Know Your
Data (Understanding
Data)
Dr. Abdul Majid

adlmjd@yahoo.com
1
Getting to know your data
Getting to know about your data
 It’s tempting to jump straight into mining, but first, we need
to get the data ready.
 Knowledge about your data is useful for data preprocessing.
 What are the types of attributes or fields that make up your
data?
 What kind of values does each attribute have? Which
attributes are discrete, and which are continuous-valued?
 What do the data look like? How are the values distributed?
 Are there ways we can visualize the data to get a better
sense of it all? Can we spot any outliers?
 Can we measure the similarity of some data objects with
respect to others?
 Gaining such insight into the data will help with the subsequent
analysis.
2
Getting to know about your data
 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data
 Measuring Data Similarity and Dissimilarity
 Summary
3
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,
crosstabs
timeout
season
coach
game
score
team
ball
lost
Document data: text documents: term-
pla
wi

n
y
frequency vector
 Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
 Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
 Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0
 Molecular Structures
 Ordered
 Video data: sequence of images
TID Items
 Temporal data: time-series
 Sequential Data: transaction sequences 1 Bread, Coke, Milk
 Genetic sequence data 2 Beer, Bread
 Spatial, image and multimedia: 3 Beer, Coke, Diaper, Milk
 Spatial data: maps
4 Beer, Bread, Diaper, Milk
 Image data:
5 Coke, Diaper, Milk
 Video data:
4
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes
5
Attributes
 Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Numeric: quantitative
 Interval-scaled
 Ratio-scaled
6
Attribute Types
 Nominal: categories, states, or “names of things”
 A nominal scale, as the name implies, is simply some placing of
data into categories, without any order or structure.
 Hair_color = {black, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings
7
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts
8
Discrete Vs Contineous Attribute Types
 Discrete Attribute
 Has only a finite set of values
 E.g., zip codes, profession, or the set of words in a
collection of documents
 Sometimes, represented as integer variables
 Note: Binary attributes are a special case of discrete
attributes
 Continuous Attribute
 Has real numbers as attribute values
 E.g., temperature, height, or weight
 Practically, real values can only be measured and
represented using a finite number of digits

 Continuous attributes are typically represented as
floating-point variables
9
Getting to Know Your Data

 Summary
10 10
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
11
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
1 n
x   xi   x
n i 1 N
 Weighted arithmetic mean: n
 Trimmed mean: chopping extreme values w x i i
 Median: x i 1
n
 Middle value if odd number of values, or average of the w

i 1
i
middle two values otherwise

 Estimated by interpolation (for grouped data):
n / 2  ( freq )l
 Mode median  L1  ( ) width
freq median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:
mean  mode  3  (mean  median)
12 12
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, positively
and negatively skewed data
13
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quantiles: Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive sets.
 Quartiles: The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth of the
data distribution. They are more commonly referred to as quartiles. Q 1
(25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
14
A plot of the data distribution for some attribute X. The

quantiles plotted are quartiles. The three quartiles divide the
distribution into four equal-size consecutive subsets. The
second quartile corresponds to the median.
15
Boxplot Analysis
 Five-number summary of a distribution

 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually
16 16
Boxplot Analysis
Boxplot for the unit price data for items sold at four
branches of AllElectronics during a given time
period.
17
Boxplot Analysis
 Visualization of Data Dispersion: 3-D Boxplots
18
 Variance and standard deviation (sample: s, population: σ) : A measure of
spread.
 Variance and standard deviation are measures of data dispersion.
 They indicate how spread out a data distribution is.
 A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large
range of values.
 Variance:
n
(algebraic, nscalable computation)
1 1 1 n 1 n 2 1 n 2
   ( xi   )   xi   s 
2 2 2 2 2
 ( xi  x ) 
2
[ xi  ( xi ) ]
N i 1 N i 1 n  1 i 1 n  1 i 1 n i 1
 Standard deviation s (or σ) is the square root of variance s2 (or σ2)
 The basic properties of the standard deviation, σ, are as follows:
 σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
 σ =0 only when there is no spread, that is, when all observations have the same value.
Otherwise, σ > 0.
19
Histrograms
 “Histos” means pole or mast, and “gram” means
chart, so a histogram is a chart of poles.
 If X is nominal, such as automobile model or item type,
then a pole or vertical bar is drawn for each known
value of X. The height of the bar indicates the
frequency (i.e., count) of that X value. The resulting
graph is more commonly known as a bar chart
 If X is numeric, the term histogram is preferred. The
range of values for X is partitioned into disjoint
consecutive subranges. The subranges, referred to as
buckets or bins, are disjoint subsets of the data
distribution for X. The range of a bucket is known as
the width. Typically, the buckets are of equal width.
20
21
Histrograms
Scatter Plots
 A scatter plot is one of the most effective
graphical methods for determining if there
appears to be a relationship, pattern, or trend
between two numeric attributes.
 Provides a first look at bivariate data to see
clusters of points, outliers, etc.
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
 Two attributes, X, and Y, are correlated if one
attribute implies the other. Correlations can be
positive, negative, or null (uncorrelated).
22
23
Scatter Plot
Scatter Plots
Scatter plots can be used to find (a) positive or (b)

negative correlations between attributes.
24
Three cases where there is no observed correlation
between the two plotted attributes in each of the
data sets.
25
Getting to Know Your Data

 Data Visualization
 Summary
26 26
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)

 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity
27
Data Matrix
 x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
 This represents n objects, such as persons, with

p variables (measurements or attributes), such
as age, height, weight, gender, and so on.
 The structure is in the form of a relational table,
or n-by-p matrix (n objects p variables)
28
Dissimilarity Matrix
 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
 It is often represented by an n-by-n where d(i, j) is the

measured difference or dissimilarity between objects i and
j.
 In general, d(i, j) is a nonnegative number that is
 close to 0 when objects i and j are highly similar or “near” each
other
 becomes larger the more they differ
 Where d(i, j)=d( j, i), and d(i, i)=0
29
Data and Dissimilarity Matrix
 Measures of similarity can often be expressed as
a function of measures of dissimilarity.
 For example, Sim(i, j)= 1-d(i,j), where Sim(i, j) is the similarity
between objects i and j. Throughout the rest of this
 A data matrix is made up of two entities or
“things,” namely rows (for objects) and columns
(for attributes). Therefore, the data matrix is
often called a two-mode matrix.
 The dissimilarity matrix contains one kind of
entity (dissimilarities) and so is called a one-
mode matrix.
30
Proximity Measures for Nominal Attributes
 Can take 2 or more states, e.g., red, yellow, blue,

green (generalization of a binary attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p 
p
m
 Method 2: Use a large number of binary

attributes
 creating a new binary attribute for each of the
M nominal states
31
 Method 1: Simple matching

 The dissimilarity between two objects i and j can be
computed based on the ratio of mismatches:
p
d (i, j)  p m
 m is the number of matches (i.e., the number of
variables for which i and j are in the same state)
 p is the total number of variables.
 Weights can be assigned to increase the effect
of m or to assign greater weight to the matches
in variables having a larger number of states.
32
 Suppose that we have
the sample data
 where test-1 is
categorical.
 Let’s compute the
dissimilarity the matrix p
d (i, j)  p m
 So that d(i, j) evaluates to 0
if objects i and j match, and
1 if the objects differ. Thus,
 Since here we have one
categorical variable,
test-1, we set p = 1 in
33
 Method 2: use a large number of binary variables
 Creating a new asymmetric binary variable for
each of the nominal states
 For an object with a given state value, the binary
variable representing that state is set to 1, while
the remaining binary variables are set to 0.
 For example, to encode the categorical variable map
_color, a binary variable can be created for each of the
five colors listed above.
 For an object having the color yellow, the yellow variable
is set to 1, while the remaining four variables are set to
0.
34
Proximity Measure for Binary Attributes
 A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and 1 means that it is
present.
 Given the variable smoker describing a patient,
 1 indicates that the patient smokes
 0 indicates that the patient does not.
 Treating binary variables as if they are interval-scaled can
lead to misleading clustering results.
 Therefore, methods specific to binary data are necessary
for computing dissimilarities.
 If all binary variables are thought of as having the same
weight, we have the 2-by-2 contingency table
35
Object j
Object i
 where
 q is the number of variables that equal 1 for both objects i and j,
 r is the number of variables that equal 1 for object i but that are 0 for
object j,
 s is the number of variables that equal 0 for object i but equal 1 for
object j, and
 t is the number of variables that equal 0 for both objects i and j.
 p is the total number of variables, p = q+r+s+t.
36
 A binary variable is symmetric if both of its states are
equally valuable and carry the same weight
 Example: the attribute gender having the states male and
female.
 Dissimilarity that is based on symmetric binary
variables is called symmetric binary dissimilarity.
 The dissimilarity (distance measure) between objects
i and j:
37
 A binary variable is asymmetric if the outcomes of the
states are not equally important,
 Example: the positive and negative outcomes of a HIV test.
 we shall code the most important outcome, which is usually the
rarest one, by 1 (HIV positive)
 Given two asymmetric binary variables, the agreement of
two 1s (a positive match) is then considered more
significant than that of two 0s (a negative match).
 Therefore, such binary variables are often considered
“monary” (as if having one state).
 The dissimilarity based on such variables is called
asymmetric binary dissimilarity
38
 The asymmetric binary similarity between the
objects i and j, or sim(i, j), can be computed as
 The coefficient sim(i, j) is called the Jaccard

coefficient
 When both symmetric and asymmetric binary
variables occur in the same data set, the mixed
variables approach can be applied (described
later)
39
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 name: an object identifier

 gender: a symmetric attribute
 fever, cough, test-1, test-2, test-3, test-4: the
asymmetric attributes
40
 Let the values Y and P be 1, and the value N = 0
 Suppose that the distance between objects
(patients) is computed based only on the
asymmetric variables.
 The distance between each pair of the three
patients, Jack, Mary, and Jim, is
0 1
d ( jack , mary )   0.33
2  0 1
11
d ( jack , jim )   0.67
111
1 2
d ( jim, mary )   0.75
11 2
41
 Let the values Y and P be 1, and the value N = 0
 Suppose that the distance between objects
(patients) is computed based only on the
asymmetric variables.
 The distance between each pair of the three
patients, Jack, Mary, and Jim, is
0 1
d ( jack , mary )   0.33
2  0 1
11
d ( jack , jim )   0.67
111
1 2
d ( jim, mary )   0.75
11 2
42
0 1
d ( jack , mary )   0.33
2  0 1
11
d ( jack , jim )   0.67
111
1 2
d ( jim, mary )   0.75
11 2
 These measurements suggest that

 Mary and Jim are unlikely to have a similar disease
because they have the highest dissimilarity value
among the three pairs.
 Of the three patients, Jack and Mary are the most
likely to have a similar disease.
43
Proximity Measure for Numeric Attributes
 Manhattan (city block) distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
 Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
 Properties of Euclidean and Manhattan

distances:
 d(i,j) ≥ 0 : Distance is a nonnegative number.
 d(i,i) = 0 : The distance of an object to itself is 0.
 d(i,j) = d(j,i) : Distance is a symmetric function.
 d(i,j) ≤ d(i,k) + d(k,j) : Triangular inequality
44
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5
Dissimilarity Matrix
2 x1
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x3
x2 3.61 0
0 2 4
x3 5.1 5.1 0
x4 4.24 1 5.39 0
45
 Minkowski Distance: It is a generalization of
the Euclidean and Manhattan distances. It is
defined as:
 Where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 h is a positive integer
 It represents the Manhattan distance when q = 1 and
Euclidean distance when q = 2
46
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
 h = 2: (L2 norm) Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
 h  . “supremum” (Lmax norm, L norm) distance.
 This is the maximum difference between any component
(attribute) of the vectors
47
 Example: Let x1 = (1, 2) and x2 = (3, 5) represent
two objects
48
Proximity Measures for Ordinal Attributes
 A discrete ordinal variable resembles a categorical
variable, except that the M states of the ordinal value are
ordered in a meaningful sequence.
 Example: professional ranks are often enumerated in a
sequential order, such as assistant, associate, and full for
professors.
 Ordinal variables may also be obtained from the
discretization of interval-scaled quantities by splitting the
value range into a finite number of classes.
 The values of an ordinal variable can be mapped to
ranks.
 Example: suppose that an ordinal variable f has Mf states.
 These ordered states define the ranking 1, … , Mf .
49
 Suppose that f is a variable from a set of ordinal
variables describing n objects.
 The dissimilarity computation with respect to f
involves the following steps:
 Step 1:
 The value of f for the ith object is xif , and f has Mf
ordered states, representing the ranking 1, … , Mf.
 Replace each xif by its corresponding rank:
rif {1,..., M f }
50
 Step 2:
Since each ordinal variable can have a different
number of states, it is often necessary to map the
range of each variable onto [0.0, 1.0] so that each
variable has equal weight.
 This can be achieved by replacing the rank r object in
if
the fth variable rif 1
zif 
 Step 3: M f 1
 Dissimilarity can then be computed using any of the
distance measures described for interval-scaled
variables.
51
 Example: Suppose that we have the sample
data:
 There are three states for test-2, namely fair,

good, and excellent, that is Mf = 3.
52
 Step 1: if we replace each value for test-2 by its
rank, the four objects are assigned the ranks 3,
1, 2, and 3, respectively.
 Step 2: normalizes the ranking by mapping rank
1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
 Step 3: we can use, say, the Euclidean distance,
which results in the following dissimilarity matrix:
53
Proximity Measures for Mixed Type Attributes
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
 Compute ranks rif and
 Treat zif as interval-scaled
54
 if either (1) xif or xjf is missing (i.e., there is no

measurement of variable f for object i or object j),
 or (2) xif = xjf = 0 and variable f is asymmetric binary
Otherwise
Example: The sample data
55
 For test-1 (which is categorical) is the same as
outlined above
 For test-2 (which is ordinal) is the same as
outlined above
 We can now calculate the dissimilarity matrices
for the two variables.
56
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
 Other vector objects: gene features in micro-arrays, …

 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d
57
Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
58
59
Thanks



Lect 2 Get To Know Your Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 2 Get To Know Your Data

Uploaded by

Copyright:

Available Formats

Getting to Know Your

Dr. Abdul Majid

 Data Objects and Attribute Types

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Data Objects and Attribute Types

 Middle value if odd number of values, or average of the w

middle two values otherwise

A plot of the data distribution for some attribute X. The

 Five-number summary of a distribution

 Visualization of Data Dispersion: 3-D Boxplots

Scatter plots can be used to find (a) positive or (b)

 Data Objects and Attribute Types

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

 This represents n objects, such as persons, with

 It is often represented by an n-by-n where d(i, j) is the

 Can take 2 or more states, e.g., red, yellow, blue,

 Method 2: Use a large number of binary

 Method 1: Simple matching

 The coefficient sim(i, j) is called the Jaccard

 name: an object identifier

 These measurements suggest that

 Properties of Euclidean and Manhattan

 h = 2: (L2 norm) Euclidean distance

 There are three states for test-2, namely fair,

 if either (1) xif or xjf is missing (i.e., there is no

 Other vector objects: gene features in micro-arrays, …

 Ex: Find the similarity between documents 1 and 2.

You might also like