You are on page 1of 59

Getting to Know Your

Data (Understanding
Data)

Dr. Abdul Majid


adlmjd@yahoo.com

1
Getting to know your data
Getting to know about your data
 It’s tempting to jump straight into mining, but first, we need
to get the data ready.
 Knowledge about your data is useful for data preprocessing.
 What are the types of attributes or fields that make up your
data?
 What kind of values does each attribute have? Which
attributes are discrete, and which are continuous-valued?
 What do the data look like? How are the values distributed?
 Are there ways we can visualize the data to get a better
sense of it all? Can we spot any outliers?
 Can we measure the similarity of some data objects with
respect to others?
 Gaining such insight into the data will help with the subsequent
analysis.

2
Getting to know your data
Getting to know about your data

 Data Objects and Attribute Types


 Basic Statistical Descriptions of Data
 Measuring Data Similarity and Dissimilarity
 Summary

3
Getting to know your data
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,
crosstabs

timeout

season
coach

game
score
team

ball

lost
Document data: text documents: term-

pla

wi

n
y
frequency vector
 Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
 Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
 Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

 Molecular Structures
 Ordered
 Video data: sequence of images
TID Items
 Temporal data: time-series
 Sequential Data: transaction sequences 1 Bread, Coke, Milk
 Genetic sequence data 2 Beer, Bread
 Spatial, image and multimedia: 3 Beer, Coke, Diaper, Milk
 Spatial data: maps
4 Beer, Bread, Diaper, Milk
 Image data:
5 Coke, Diaper, Milk
 Video data:

4
Getting to know your data
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes

5
Getting to know your data
Attributes
 Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

6
Getting to know your data
Attribute Types
 Nominal: categories, states, or “names of things”
 A nominal scale, as the name implies, is simply some placing of
data into categories, without any order or structure.
 Hair_color = {black, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

7
Getting to know your data
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts

8
Getting to know your data
Discrete Vs Contineous Attribute Types
 Discrete Attribute
 Has only a finite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as

floating-point variables

9
Getting to know your data
Getting to Know Your Data

 Data Objects and Attribute Types


 Basic Statistical Descriptions of Data
 Measuring Data Similarity and Dissimilarity
 Summary

10 10
Getting to know your data
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube

11
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
1 n
x   xi   x
n i 1 N
 Weighted arithmetic mean: n
 Trimmed mean: chopping extreme values w x i i

 Median: x i 1
n

 Middle value if odd number of values, or average of the w


i 1
i

middle two values otherwise


 Estimated by interpolation (for grouped data):

n / 2  ( freq )l
 Mode median  L1  ( ) width
freq median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:
mean  mode  3  (mean  median)
12 12
Getting to know your data
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, positively
and negatively skewed data

13
Getting to know your data
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quantiles: Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive sets.
 Quartiles: The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth of the
data distribution. They are more commonly referred to as quartiles. Q 1
(25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR

14
Getting to know your data
Measuring the Dispersion of Data

A plot of the data distribution for some attribute X. The


quantiles plotted are quartiles. The three quartiles divide the
distribution into four equal-size consecutive subsets. The
second quartile corresponds to the median.

15
Boxplot Analysis

 Five-number summary of a distribution


 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

16 16
Getting to know your data
Boxplot Analysis

Boxplot for the unit price data for items sold at four
branches of AllElectronics during a given time
period.

17
Getting to know your data
Boxplot Analysis

 Visualization of Data Dispersion: 3-D Boxplots

18
Getting to know your data
Measuring the Dispersion of Data
 Variance and standard deviation (sample: s, population: σ) : A measure of
spread.
 Variance and standard deviation are measures of data dispersion.
 They indicate how spread out a data distribution is.
 A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large
range of values.

 Variance:
n
(algebraic, nscalable computation)
1 1 1 n 1 n 2 1 n 2
   ( xi   )   xi   s 
2 2 2 2 2
 ( xi  x ) 
2
[ xi  ( xi ) ]
N i 1 N i 1 n  1 i 1 n  1 i 1 n i 1
 Standard deviation s (or σ) is the square root of variance s2 (or σ2)
 The basic properties of the standard deviation, σ, are as follows:
 σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
 σ =0 only when there is no spread, that is, when all observations have the same value.
Otherwise, σ > 0.

19
Getting to know your data
Histrograms
 “Histos” means pole or mast, and “gram” means
chart, so a histogram is a chart of poles.
 If X is nominal, such as automobile model or item type,
then a pole or vertical bar is drawn for each known
value of X. The height of the bar indicates the
frequency (i.e., count) of that X value. The resulting
graph is more commonly known as a bar chart
 If X is numeric, the term histogram is preferred. The
range of values for X is partitioned into disjoint
consecutive subranges. The subranges, referred to as
buckets or bins, are disjoint subsets of the data
distribution for X. The range of a bucket is known as
the width. Typically, the buckets are of equal width.

20
Getting to know your data
21
Histrograms
Getting to know your data
Scatter Plots
 A scatter plot is one of the most effective
graphical methods for determining if there
appears to be a relationship, pattern, or trend
between two numeric attributes.
 Provides a first look at bivariate data to see
clusters of points, outliers, etc.
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
 Two attributes, X, and Y, are correlated if one
attribute implies the other. Correlations can be
positive, negative, or null (uncorrelated).

22
Getting to know your data
23
Scatter Plot
Getting to know your data
Scatter Plots

Scatter plots can be used to find (a) positive or (b)


negative correlations between attributes.

24
Getting to know your data
Three cases where there is no observed correlation
between the two plotted attributes in each of the
data sets.

25
Getting to know your data
Getting to Know Your Data

 Data Objects and Attribute Types


 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary

26 26
Getting to know your data
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)


 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

27
Getting to know your data
Data Matrix
 x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 This represents n objects, such as persons, with


p variables (measurements or attributes), such
as age, height, weight, gender, and so on.
  The structure is in the form of a relational table,
or n-by-p matrix (n objects p variables)

28
Getting to know your data
Dissimilarity Matrix
 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

 It is often represented by an n-by-n where d(i, j) is the


measured difference or dissimilarity between objects i and
j.
  In general, d(i, j) is a nonnegative number that is
 close to 0 when objects i and j are highly similar or “near” each
other
 becomes larger the more they differ
  Where d(i, j)=d( j, i), and d(i, i)=0

29
Getting to know your data
Data and Dissimilarity Matrix
 Measures of similarity can often be expressed as
a function of measures of dissimilarity.
 For example, Sim(i, j)= 1-d(i,j), where Sim(i, j) is the similarity
between objects i and j. Throughout the rest of this
 A data matrix is made up of two entities or
“things,” namely rows (for objects) and columns
(for attributes). Therefore, the data matrix is
often called a two-mode matrix.
 The dissimilarity matrix contains one kind of
entity (dissimilarities) and so is called a one-
mode matrix.

30
Getting to know your data
Proximity Measures for Nominal Attributes

 Can take 2 or more states, e.g., red, yellow, blue,


green (generalization of a binary attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p 
p
m

 Method 2: Use a large number of binary


attributes
 creating a new binary attribute for each of the
M nominal states

31
Getting to know your data
Proximity Measures for Nominal Attributes

 Method 1: Simple matching


 The dissimilarity between two objects i and j can be
computed based on the ratio of mismatches:
p
d (i, j)  p m
 m is the number of matches (i.e., the number of
variables for which i and j are in the same state)
  p is the total number of variables.
 Weights can be assigned to increase the effect
of m or to assign greater weight to the matches
in variables having a larger number of states.

32
Getting to know your data
Proximity Measures for Nominal Attributes
 Suppose that we have
the sample data
 where test-1 is
categorical.
 Let’s compute the
dissimilarity the matrix p
d (i, j)  p m
 So that d(i, j) evaluates to 0
if objects i and j match, and
1 if the objects differ. Thus,
 Since here we have one
categorical variable,
test-1, we set p = 1 in

33
Getting to know your data
Proximity Measures for Nominal Attributes
 Method 2: use a large number of binary variables
 Creating a new asymmetric binary variable for
each of the nominal states
 For an object with a given state value, the binary
variable representing that state is set to 1, while
the remaining binary variables are set to 0.
 For example, to encode the categorical variable map
_color, a binary variable can be created for each of the
five colors listed above.
 For an object having the color yellow, the yellow variable
is set to 1, while the remaining four variables are set to
0.

34
Getting to know your data
Proximity Measure for Binary Attributes
 A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and 1 means that it is
present.
 Given the variable smoker describing a patient,
 1 indicates that the patient smokes
 0 indicates that the patient does not.
 Treating binary variables as if they are interval-scaled can
lead to misleading clustering results.
 Therefore, methods specific to binary data are necessary
for computing dissimilarities.
 If all binary variables are thought of as having the same
weight, we have the 2-by-2 contingency table

35
Getting to know your data
Proximity Measure for Binary Attributes
Object j

Object i

 where
 q is the number of variables that equal 1 for both objects i and j,
 r is the number of variables that equal 1 for object i but that are 0 for
object j,
 s is the number of variables that equal 0 for object i but equal 1 for
object j, and
 t is the number of variables that equal 0 for both objects i and j.
 p is the total number of variables, p = q+r+s+t.

36
Getting to know your data
Proximity Measure for Binary Attributes
 A binary variable is symmetric if both of its states are
equally valuable and carry the same weight
 Example: the attribute gender having the states male and
female.
 Dissimilarity that is based on symmetric binary
variables is called symmetric binary dissimilarity.
 The dissimilarity (distance measure) between objects
i and j:

37
Getting to know your data
Proximity Measure for Binary Attributes
 A binary variable is asymmetric if the outcomes of the
states are not equally important,
 Example: the positive and negative outcomes of a HIV test.
 we shall code the most important outcome, which is usually the
rarest one, by 1 (HIV positive)
 Given two asymmetric binary variables, the agreement of
two 1s (a positive match) is then considered more
significant than that of two 0s (a negative match).
 Therefore, such binary variables are often considered
“monary” (as if having one state).
  The dissimilarity based on such variables is called
asymmetric binary dissimilarity

38
Getting to know your data
Proximity Measure for Binary Attributes
 The asymmetric binary similarity between the
objects i and j, or sim(i, j), can be computed as

 The coefficient sim(i, j) is called the Jaccard


coefficient
 When both symmetric and asymmetric binary
variables occur in the same data set, the mixed
variables approach can be applied (described
later)

39
Getting to know your data
Proximity Measure for Binary Attributes
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 name: an object identifier


  gender: a symmetric attribute
 fever, cough, test-1, test-2, test-3, test-4: the
asymmetric attributes

40
Getting to know your data
Proximity Measure for Binary Attributes
 Let the values Y and P be 1, and the value N = 0
 Suppose that the distance between objects
(patients) is computed based only on the
asymmetric variables.
 The distance between each pair of the three
patients, Jack, Mary, and Jim, is
0 1
d ( jack , mary )   0.33
2  0 1
11
d ( jack , jim )   0.67
111
1 2
d ( jim, mary )   0.75
11 2

41
Getting to know your data
Proximity Measure for Binary Attributes
 Let the values Y and P be 1, and the value N = 0
 Suppose that the distance between objects
(patients) is computed based only on the
asymmetric variables.
 The distance between each pair of the three
patients, Jack, Mary, and Jim, is
0 1
d ( jack , mary )   0.33
2  0 1
11
d ( jack , jim )   0.67
111
1 2
d ( jim, mary )   0.75
11 2

42
Getting to know your data
Proximity Measure for Binary Attributes

0 1
d ( jack , mary )   0.33
2  0 1
11
d ( jack , jim )   0.67
111
1 2
d ( jim, mary )   0.75
11 2

 These measurements suggest that


 Mary and Jim are unlikely to have a similar disease
because they have the highest dissimilarity value
among the three pairs.
 Of the three patients, Jack and Mary are the most
likely to have a similar disease.

43
Getting to know your data
Proximity Measure for Numeric Attributes
 Manhattan (city block) distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
 Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 Properties of Euclidean and Manhattan


distances:
 d(i,j) ≥ 0 : Distance is a nonnegative number.
 d(i,i) = 0 : The distance of an object to itself is 0.
 d(i,j) = d(j,i) : Distance is a symmetric function.
 d(i,j) ≤ d(i,k) + d(k,j) : Triangular inequality

44
Getting to know your data
Proximity Measure for Numeric Attributes
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5

Dissimilarity Matrix
2 x1
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x3
x2 3.61 0
0 2 4
x3 5.1 5.1 0
x4 4.24 1 5.39 0

45
Getting to know your data
Proximity Measure for Numeric Attributes
 Minkowski Distance: It is a generalization of
the Euclidean and Manhattan distances. It is
defined as:

 Where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 h is a positive integer
 It represents the Manhattan distance when q = 1 and
Euclidean distance when q = 2

46
Getting to know your data
Proximity Measure for Numeric Attributes
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
 h  . “supremum” (Lmax norm, L norm) distance.
 This is the maximum difference between any component
(attribute) of the vectors

47
Getting to know your data
Proximity Measure for Numeric Attributes
 Example: Let x1 = (1, 2) and x2 = (3, 5) represent
two objects

48
Getting to know your data
Proximity Measures for Ordinal Attributes
 A discrete ordinal variable resembles a categorical
variable, except that the M states of the ordinal value are
ordered in a meaningful sequence.
 Example: professional ranks are often enumerated in a
sequential order, such as assistant, associate, and full for
professors.
 Ordinal variables may also be obtained from the
discretization of interval-scaled quantities by splitting the
value range into a finite number of classes.
 The values of an ordinal variable can be mapped to
ranks.
 Example: suppose that an ordinal variable f has Mf states.
 These ordered states define the ranking 1, … , Mf .

49
Getting to know your data
Proximity Measures for Ordinal Attributes
 Suppose that f is a variable from a set of ordinal
variables describing n objects.
 The dissimilarity computation with respect to f
involves the following steps:
 Step 1:
 The value of f for the ith object is xif , and f has Mf
ordered states, representing the ranking 1, … , Mf.
 Replace each xif by its corresponding rank:

rif {1,..., M f }

50
Getting to know your data
Proximity Measures for Ordinal Attributes
 Step 2:
Since each ordinal variable can have a different
number of states, it is often necessary to map the
range of each variable onto [0.0, 1.0] so that each
variable has equal weight.
 This can be achieved by replacing the rank r object in
if
the fth variable rif 1
zif 
 Step 3:  M f 1
 Dissimilarity can then be computed using any of the
distance measures described for interval-scaled
variables.

51
Getting to know your data
Proximity Measures for Ordinal Attributes
 Example: Suppose that we have the sample
data:

 There are three states for test-2, namely fair,


good, and excellent, that is Mf = 3.

52
Getting to know your data
Proximity Measures for Ordinal Attributes
 Step 1: if we replace each value for test-2 by its
rank, the four objects are assigned the ranks 3,
1, 2, and 3, respectively.
 Step 2: normalizes the ranking by mapping rank
1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
 Step 3: we can use, say, the Euclidean distance,
which results in the following dissimilarity matrix:

53
Getting to know your data
Proximity Measures for Mixed Type Attributes
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,

ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
 Compute ranks rif and
 Treat zif as interval-scaled

54
Getting to know your data
Proximity Measures for Mixed Type Attributes

 if either (1) xif or xjf is missing (i.e., there is no


measurement of variable f for object i or object j),
 or (2) xif = xjf = 0 and variable f is asymmetric binary
Otherwise
Example: The sample data

55
Getting to know your data
Proximity Measures for Mixed Type Attributes
 For test-1 (which is categorical) is the same as
outlined above
 For test-2 (which is ordinal) is the same as
outlined above
 We can now calculate the dissimilarity matrices
for the two variables.

56
Getting to know your data
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

57
Getting to know your data
Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

58
Getting to know your data
59
Thanks


You might also like