You are on page 1of 51

Knowing your Data and Data

Visualization
Source: Chapter 2
book: Data Mining by Han and Kamber

CSE25.8: Elective-I
MCA II Sem
March-June 2022

Content
• Data objects and Attributes
• Measures of Central tendency
• Dispersion of data
• Data Visualization
• Data proximity measures

1
Data objects
• A data object represents an entity:
– In a sales database, the objects may be customers, store items,
and sales;
– in a medical database, the objects may be patients;
– in a university database, the objects may be students,
professors, and courses.

• Data objects are typically described by attributes.


• Data objects can also be referred to as samples,
examples, instances, data points, or objects.

Attributes/Feature vector
• A set of attributes used to describe a given object is
called an attribute vector (or feature vector).
• These terms are synonymous in Data science: attribute,
dimension, feature, predictor, and variable.
• The term dimension is commonly used in data
warehousing.
• Machine learning literature tends to use the term feature,
while statisticians prefer the term variable.
• Data mining and database professionals commonly use
the term attribute.

2
Data
• The distribution of data involving one
attribute/variable is called univariate.
• A bivariate distribution involves two
attributes.
• Multivariate, and so on.

Attribute Types
The type of an attribute is determined by the set of possible values the
attribute can have:
• Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude
between successive values is not known.
– Size = {small, medium, large}, grades, army rankings

3
Nominal attribute
• Nominal/categorical
– e.g., customer _ID, name, address, hair color, marital status, etc.
– possible values for hair color are black, brown, blond, red, auburn,
gray, and white.
– marital status can take values single, married, divorced, and
widowed, etc.

• Even though a nominal attribute may have integers as


values, it is not considered a numeric attribute because the
integers are not meant to be used quantitatively.

Binary Attribute
• A binary attribute is a nominal attribute with only two
categories or states: 0 or 1, where 0 typically means that
the attribute is absent, and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two
states correspond to true and false.

4
Ordinal attributes
• Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured
objectively; thus ordinal attributes are often used in
surveys for ratings.
– In one survey, participants were asked to rate how satisfied they
were as customers.
– Customer satisfaction had the following ordinal categories: 0:
very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3:
satisfied, and 4: very satisfied.

• Ordinal attributes may also be obtained from the


discretization of numeric quantities by splitting the value
range into a finite number of ordered categories

Numeric Attribute Types


• Quantity (integer or real-valued)
• For numeric attributes, measures of central tendency e.g.,
mean value, median and mode can be computed.
• Interval
• Measured on a scale of equal-sized units
• Values have order
– E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of
magnitude larger than the unit of measurement (10
K˚ is twice as high as 5 K˚).
– e.g., temperature in Kelvin, length, counts,
monetary quantities

5
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and
represented using a finite number of digits
– Continuous attributes are typically represented as
floating-point variables

Measuring central tendency


• Knowing such basic statistics regarding each attribute
makes it easier to fill in missing values, smooth noisy
values, and spot outliers during data preprocessing.
• Knowledge of the attributes and attribute values can also
help in fixing inconsistencies incurred during data
integration.
• Plotting the measures of central tendency shows us if
the data are symmetric or skewed.
• Quantile plots, histograms, and scatter plots are other
graphic displays of basic statistical descriptions.
• These can all be useful during data preprocessing and
can provide insight into areas for mining.

6
Basic Statistical Descriptions of Data
• Basic statistical descriptions can be used to identify
properties of the data and highlight which data values
should be treated as noise or outliers.
• Measures of central tendency
– given an attribute, where do most of its values fall e.g., mean,
median, mode, and midrange
• Dispersion of the data
– how are the data spread out e.g., range, quartiles, interquartile
range; the five-number summary and boxplots; variance and
standard deviation of the data
• Graphic displays of basic statistical descriptions to
visually inspect our data
– bar charts, pie charts, and line graphs, quantile plots, quantile–
quantile plots, histograms, and scatter plots

Measuring the Central Tendency


n
Mean (algebraic measure) (sample vs. population): 1 x


x  xi
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
wx

i i
– Trimmed mean: chopping extreme values x  i 1
n
• Median: w
i 1
i

– Middle value if odd number of values, or average of


the middle two values otherwise
– Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median  L1  ( ) width
• Mode freqmedian
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal

14
– Empirical formula: mean  mode  3  (mean  median)
for unimodal numeric data that are moderately skewed

7
Trimmed Mean
• It is the mean obtained after chopping off values at
the high and low extremes.

• A major problem with the mean is its sensitivity to


extreme (e.g., outlier) values.
• Even a small number of extreme values can corrupt the
mean.
– e.g., the mean salary at a company may be substantially pushed
up by that of a few highly paid managers.
– mean score of a class in an exam could be pulled down quite a
bit by a few very low scores.

Median and Mid range


• For skewed (asymmetric) data, a better measure of the
center of data is the median, which is the middle value in a
set of ordered data values.

• The midrange average of the largest and smallest values in


the set.
• This measure is easy to compute using the SQL aggregate
functions, max() and min().

8
Interval Median
• The median is expensive to compute when we have a large number of
observations.
• For numeric attributes, however, we can easily approximate the value.
– E.g., employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on. Let the interval that
contains the median frequency be the median interval.
• We can approximate the median of the entire data set (e.g., the median
salary) by interpolation using the formula:
n / 2  ( freq )l
median  L1  ( ) width
freqmedian
• where L1 is the lower boundary of the median interval
• n is the number of values in the entire data set
• (∑freq)l is the sum of the frequencies of all of the intervals that are lower
than the median interval
• freqmedian is the frequency of the median interval
• width is the width of the median interval.

Symmetric vs. Skewed Data


• Data in most real applications are not
symmetric.
• They may instead be either positively symmetric
skewed, where the mode occurs at a
value that is smaller than the median
(Figure), or negatively skewed, where the
mode occurs at a value greater than the
median.

positively skewed negatively skewed

18

Data Mining: Concepts and Techniques

9
Range and Quartiles
• The range of the set is the difference between the largest
(max()) and smallest (min()) values.
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size
consecutive sets.
– The 2-quantile is the data point dividing the lower and upper
halves of the data distribution which corresponds to the median.

– The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-
fourth of the data distribution. They are more commonly
referred to as quartiles.

– The 100-quantiles are more commonly referred to as percentiles;


they divide the data distribution into 100 equal-sized consecutive
sets.

Interquartile Range
• The quartiles give an indication of a distribution’s center,
spread, and shape.
– The first quartile, denoted by Q1, is the 25th percentile. It cuts off
the lowest 25% of the data.
– The third quartile, denoted by Q3, is the 75th percentile—it cuts
off the lowest 75% (or highest 25%) of the data.

• The distance between the first and third quartiles is a


simple measure of spread that gives the range covered
by the middle half of the data.
• This distance is called the interquartile range (IQR) and is
defined as: IQR=Q3-Q1.
• Boxplots can be computed in O(n*logn) time.

10
Box Plot
• A boxplot is a standardized way of displaying distribution
of data on a 5 number summary:
– Min
– first quartile (Q1)
– Median
– third quartile (Q3)
– Max

• The box contains the middle 50% of data points and


each of the two whiskers contain 25% of data points.
• Outlier: usually, a value higher/lower than 1.5 * IQR
• e.g.
– smaller than Q1 - 1.5 IQR or
– bigger than Q3 + 1.5 IQR
• This is called a Box plot.

Drawing Box Plot


• Consider this data:2, 51, 53, 54, 43, 51, 62, 49, 50, 63, 60.
• Arrange data in ascending order:
2, 43, 49, 50, 51, 51, 53, 54, 60, 62, 63
• Here no. of observation n=11
• Median=(n+1)/2=(11+1)/2=6, so median is 51.
• Q1=Median of first half of the data
– i.e. Q1=(5+1)/2=3, so Q1 is 49.
• Q3=median of 2nd half of data
– i.e. Q3=(5+1)/2=3, so Q3 is 60.
• IQR=Q3-Q1=60-49=11
• Outlier: smaller than Q1 - 1.5 IQR (=32.5) or bigger than
Q3 + 1.5 IQR (76.5)
– e.g. 2 is outlier as it is smaller than 49-1.5 IQR

11
Boxplot Analysis
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended
to Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually

e.g. Box Plot

12
Variance and Standard deviation
• Variance and standard deviation are measures of
data dispersion.
• SD measures spread about the mean and should be
considered only when the mean is chosen as the
measure of center.
• SD indicates how spread out a data distribution is.
 A low standard deviation means that the data
observations tend to be very close to the mean.
 While a high standard deviation indicates that the data
are spread out over a large range of values.

Measuring dispersion of Data


• Standard deviation and variance has following
properties:
– proportional to scatter of data e.g. small when data are clustered
together and large when data are widely scattered.
– independent of no. of values in data
– independent of mean of data.

• Variance = (S.D.)2

1 n 1 n 2 1 n 2
s2   i
n 1 i1
( x  x ) 2
 [ xi  ( xi ) ]
n 1 i1 n i1

– when data is only a sample of whole population, we replace ‘n’


by ‘n-1’.

13
Graphic Displays of Basic Statistical
Descriptions of Data
• These include:
– quantile plots
– quantile–quantile plots
– Histograms
– scatter plots

• Such graphs are helpful for the visual inspection of data,


which is useful for data preprocessing.
• The first three of these: quantile plots, quantile–quantile
plots, histograms show univariate distributions (i.e.,
data for one attribute), while scatter plots show bivariate
distributions (i.e., involving two attributes).

Quantile plot
• A quantile plot is a simple and effective
way to have a first look at a univariate data
distribution.
• It displays all of the data for the given
attribute.

14
Quantile Plot
• Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi

Data Mining: Concepts and Techniques

Quantile-Quantile (Q-Q) Plot


• Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile.
• Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

15
Gaussian or Normal distribution
• The Gaussian probability
distribution function is a kind of
pdf defined by:

( x )2
1 
g ( x,  ,  ) 
2
2
e
2 
With μ being the mean and σ being the
standard deviation. Gaussian distribution with zero
mean and a standard deviation

Probability distributions

Height of adults in cm

.28
.24
Probability

.17
.12
.09
.06
.03
.01
<150 150 160 170 180 190 200 >210
to to to to to to
160 170 180 190 200 210

16
Properties of Normal Distribution Curve

• The normal (distribution) curve


– From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it

33

Histograms
• Histograms (or frequency histograms) are at least a
century old and are widely used.
– “Histos” means pole or mast, and “gram” means chart, so a
histogram is a chart of poles.

• Plotting histograms is a graphical method for


summarizing the distribution of a given attribute, X.
– e.g., a price attribute with a value range of $1 to $200 (rounded
up to the nearest dollar) can be partitioned into subranges 1 to
20, 21 to 40, 41 to 60, and so on.
– For each subrange, a bar is drawn with a height that represents
the total count of items observed within the subrange.

17
Histograms
• If X is nominal, such as automobile model or item type,
then a pole or vertical bar is drawn for each known value
of X.
– The height of the bar indicates the frequency (i.e., count) of that X
value.
– The resulting graph is more commonly known as a bar chart.

• If X is numeric, the term histogram is preferred.


• The range of values for X is partitioned into disjoint
consecutive subranges.
– The subranges, referred to as buckets or bins, are disjoint subsets
of the data distribution for X.
– The range of a bucket is known as the width.

Histogram

18
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars 40
• It shows what proportion of cases fall 35
into each of several categories 30
• Differs from a bar chart in that it is the 25
area of the bar that denotes the
value, not the height as in bar charts, 20
a crucial distinction when the 15
categories are not of uniform width 10
• The categories are usually specified 5
as non-overlapping intervals of some
0
variable. The categories (bars) must 10000 30000 50000 70000 90000
be adjacent

Histograms often tell more than Boxplots


 The two histograms
shown in the left may
have the same boxplot
representation
 The same values for: min,
Q1, median, Q3, max
 But they have rather
different data distributions.

19
Example of an image and the
associated histogram

Histogram Processing
Figure showing 4
basic image types:
dark,
light,
low contrast,
high contrast,
and their
corresponding
histograms

20
Image segmentation with histograms

• In figure(a): light objects in dark background


• To extract the objects:
– Select a T that separates the objects from the background
– i.e. any (x,y) for which f(x,y)>T is an object point.

1 if f ( x, y )  T (objects)
g ( x, y )  
0 if f ( x, y )  T
(background)

Scatter Plots and Data Correlation


• A scatter plot is one of the most effective graphical
methods for determining if there appears to be a
relationship, pattern, or trend between two numeric
attributes.

• To construct a scatter plot, each pair of values is treated


as a pair of coordinates in an algebraic sense and plotted
as points in the plane.

• Figure 2.7 shows a scatter plot for the set of data in Table
2.1.

21
Scatter Plot

2D Scatter Plot

A scatter plot displays


2-D data points using
Cartesian coordinates.

through this
visualization, we can
see that points of
types “+” and “x”
tend to be co-
located.

22
Scatter plot Train v/s Test data

Box plot Train v/s Test data

23
Scatter Plots

Scatter plots can be used to find (a) positive or (b) negative


correlations between attributes.

Positively and Negatively Correlated Data


• The left half fragment is positively
correlated
• The right half is negative correlated

24
Uncorrelated Data

49

Correlation Coefficient
• Correlation coefficient (r) ranges from -1 to +1 e.g., +1
and -1 indicate close relation.
• Pearson correlation technique works with linear
relationships only e.g. one variable gets larger, the other
variable gets larger/smaller in direct proportion.
• It fails for curvilinear relationships.
– Example of curvilinear relationship is age and healthcare; young
children and older people both tend to use much more
healthcare than teenager/young adults.

• Multiple regression can be used to examine such


relationships.

25
Data Visualization
• Why data visualization?
– Gain insight into an information space by mapping data onto
graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities, relationships
among data
– Help find interesting regions and suitable parameters for further
quantitative analysis
– Provide a visual proof of computer representations derived
• Categorization of visualization methods:
– Pixel-oriented visualization techniques
– Geometric projection visualization techniques
– Icon-based visualization techniques
– Hierarchical visualization techniques
– Visualizing complex data and relations

Pixel-Oriented Visualization
Techniques
• A simple way to visualize the value of a dimension is to
use a pixel where the color of the pixel reflects the
dimension’s value.
• For a data set of m dimensions, pixel-oriented
techniques create m windows on the screen, one for
each dimension.
• The m dimension values of a record are mapped to m
pixels at the corresponding positions in the windows.
• The colors of the pixels reflect the corresponding values.

26
e.g., Pixel-Oriented Visualization
• AllElectronics maintains a customer information table,
which consists of four dimensions: income, credit limit,
transaction volume, and age.

• Can we analyze the correlation between income and the


other attributes by visualization?

• Sort all customers in income-ascending order, and use this


order to lay out the customer data in the four visualization
windows, as shown in Figure 2.10.

– The pixel colors are chosen so that the smaller the


value, the lighter the shading.

Pixel-Oriented Visualization

Using pixel-based visualization, we can easily observe the following:


credit limit increases as income increases;
customers whose income is in the middle range are more likely to purchase
more from AllElectronics;
there is no clear correlation between income and age.
Figure 2.10 Pixel-oriented visualization of four attributes by sorting
all customers in income ascending order.

27
Geometric Projection Visualization
• A drawback of pixel-oriented visualization techniques is
that they cannot help us much in understanding the
distribution of data in a multi-dimensional space.
– e.g., they do not show whether there is a dense area in a
multidimensional subspace.

• Geometric projection techniques help users find


interesting projections of multidimensional data sets.
• The central challenge these try to address is how to
visualize a high-dimensional space on a 2-D display.

Geometric projection
using 2D Scatter Plot
A third dimension
in scatter plot can
be added using
different colors or
shapes to
represent different
data points.

Figure shows an
example, where X and
Y are two spatial
attributes and the third
dimension is
represented by
different shapes.

28
Geometric projection
A 3-D scatter plot using 3D Scatter Plot
uses three axes in
a Cartesian
coordinate system.

If it also uses
color, it can
display up to 4-D
data points
(Figure 2.14).

Figure 2.14 Visualization of a 3-D data set using a scatter plot. Source:
http://upload.wikimedia.org/wikipedia/commons/c/c4/Scatter plot.jpg.
A data record is represented by a polygonal line that intersects

Geometric projection using scatter-plot matrix


• For data sets with more than four dimensions, scatter
plots are usually ineffective.
• The scatter-plot matrix technique is a useful extension
to the scatter plot.
• For an n dimensional data set, a scatter-plot matrix is
an nXn grid of 2-D scatter plots that provides a
visualization of each dimension with every other
dimension.
• Figure 2.15 shows an example, which visualizes the Iris
data set.
– The data set consists of 450 samples from each of three species of
Iris flowers.
– There are five dimensions in the data set: length and width of sepal
and petal, and species.

29
Geometric projection using scatter-plot matrix

Geometric projection using scatter-plot matrix

• The scatter-plot matrix becomes less effective as the


dimensionality increases.

30
Icon-Based Visualization Techniques
• Icon-based visualization techniques use small icons to
represent multidimensional data values.

• Two popular icon-based techniques: Chernoff faces


and stick figures.

Icon-based technique Chernoff faces


• Chernoff faces were introduced in 1973 by statistician
Herman Chernoff.
• They display multidimensional data of up to 18
variables (or dimensions) as a cartoon human face
(Figure 2.17).
• Chernoff faces help reveal trends in the data.
• Components of the face, such as the eyes, ears,
mouth, and nose, represent values of the dimensions
by their shape, size, placement, and orientation.
– e.g., dimensions can be mapped to the following facial
characteristics: eye size, eye spacing, nose length, nose width,
mouth curvature, mouth width, mouth openness, pupil size,
eyebrow slant, eye eccentricity, and head eccentricity.

31
Figure 2.17 Chernoff faces
Icon-Based Visualization

Each face represents an n-dimensional data point (n


<=18).

Chernoff faces
• Chernoff faces make use of the ability of the
human mind to recognize small differences in
facial characteristics and to assimilate many
facial characteristics at once.
• Viewing large tables of data can be tedious.
• By condensing the data, Chernoff faces make
the data easier for users to digest.

32
e.g., Chernoff visualization

Drawback of Chernoff faces


• They facilitate visualization of regularities and
irregularities present in the data, although their power in
relating multiple relationships is limited.
• Another limitation is that specific data values are not
shown.
• Furthermore, facial features vary in perceived
importance.
• This means that the similarity of two faces (representing
two multidimensional data points) can vary depending on
the order in which dimensions are assigned to facial
characteristics.
• So, this mapping should be carefully chosen.
• Eye size and eyebrow slant have been found to be
important.

33
Stick figure visualization
• The stick figure visualization
technique maps multidimensional
data to five-piece stick figures,
– where each figure has four
limbs and a body.

• Two dimensions are mapped to


the display (x and y) axes
• and the remaining dimensions
are mapped to the angle and/or
length of the limbs.

Stick figure visualization


• Figure 2.18 shows census data,
– where age and income are mapped to the display
axes,
– and the remaining dimensions (gender, education,
and so on) are mapped to stick figures.
• If the data items are relatively dense with respect to the
two display dimensions, the resulting visualization
shows texture patterns, reflecting data trends.

34
Stick figure visualization

Figure 2.18 Census data represented using stick figures. Source: Professor
G. Grinstein, Department of Computer Science, University of Massachusetts
at Lowell.

Hierarchical Visualization Techniques


• The visualization techniques discussed so far focus on
visualizing multiple dimensions simultaneously.

• However, for a large data set of high dimensionality, it


would be difficult to visualize all dimensions at the same
time.

• Hierarchical visualization techniques partition all


dimensions into subsets (i.e., subspaces).

• The subspaces are visualized in a hierarchical manner.

35
Dendrogram

• A dendrogram is a diagram that shows the hierarchical


relationship between objects. It is most commonly
created as an output from hierarchical clustering.

• The main use of a dendrogram is to work out the best


way to allocate objects to clusters.

• The dendrogram below shows the hierarchical clustering


of six observations shown on the scatterplot to the left.

dendrogram

36
e.g., Dendrogram

VOSviewer
• VOSviewer is a software tool for constructing and
visualizing bibliometric networks.

• These networks may for instance include journals,


researchers, or individual publications, and they can be
constructed based on citation, bibliographic coupling, co-
citation, or co-authorship relations.

• VOSviewer also offers text mining functionality that can be


used to construct and visualize co-occurrence networks of
important terms extracted from a body of scientific
literature.

37
VOS viewer

Social network visualization

visualize & map your social network in Python using NetworkX


https://networkx.org/

38
Similarity and Dissimilarity
• For algorithms such as clustering, outlier analysis, and
nearest-neighbor classification, we need ways to assess how
alike or unalike objects are in comparison to one another.
• Outlier analysis also employs clustering-based techniques to
identify potential outliers as objects that are highly dissimilar
to others.
• For example,
– a store may want to search for clusters of customer objects, resulting
in groups of customers with similar characteristics (e.g., similar
income, area of residence, and age). Such information can then be
used for marketing.
– Knowledge of object similarities can also be used in nearest-neighbor
classification schemes where a given object (e.g., a patient) is
assigned a class label (relating to, say, a diagnosis) based on its
similarity toward other objects in the model.

Similarity and Dissimilarity


• Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity

39
Data Matrix and Dissimilarity Matrix
• Data matrix
– n data points with p
dimensions  x 11 ... x 1f ... x 1p 
 
– Two modes (it has 2  ... ... ... ... ... 
entities rows and x ... x if ... x ip 
 i1 
columns)  ... ... ... ... ... 
x ... x nf ... x np 
• Dissimilarity matrix  n1 
– n data points, but
registers only the  0 
distance  d(2,1) 0 
 
– A triangular matrix  d(3,1 ) d ( 3 ,2 ) 0 
 
– Single mode (it has one  : : : 
kind of entity dissimilarity)  d ( n ,1 ) d ( n,2 ) ... ... 0 

dissimilarity or “difference”
• d(i, j) is the measured dissimilarity or “difference”
between objects i and j.
– d(i, j) is a non-negative number that is close to 0 when
objects i and j are highly similar or “near” each other,
– it becomes larger the more they differ.
– d(i, j)=0; that is, the difference between an object and
itself is 0.
• Measures of similarity can often be expressed as a
function of measures of dissimilarity.
• For example, for nominal data,

40
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue,
green (generalization of a binary attribute)
• Method 1: Simple matching
– m: # of matches, p: total # of variables/attributes

d ( i , j )  p p m
• Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M
nominal states
Alternatively, similarity can be computed as

Example 2.17: Dissimilarity between nominal


attributes

d ( i , j )  p p m =no. of mismatches / no. of


total variables

• Here p=1, as we have just one nominal attribute i.e., test 1


• From this, we see that all objects are dissimilar except
objects 1 and 4 (i.e., d(4,1)=0).

41
Symmetric v/s Asymmetric Binary attributes
• Symmetric binary attribute: a binary attribute has only
one of two states: 0 and 1, where 0 means that the
attribute is absent, and 1 means that it is present
– e.g., 1 indicates that the patient smokes, while 0 indicates that
the patient does not

• Asymmetric binary attribute: for an object with a given


state value, the binary attribute representing that state is
set to 1, while the remaining binary attributes are set to
0.
– e.g., to encode the nominal attribute map color, a binary attribute
can be created for each of the five colors.
– For an object having the color yellow, the yellow attribute is set
to 1, while the remaining four attributes are set to 0.

Proximity Measure for symmetric Binary


Attributes

Contingency Table for Binary Attributes


• q is the number of attributes that equal 1 for both objects i and j
• r is the number of attributes that equal 1 for object i but equal 0 for
object j
• s is the number of attributes that equal 0 for object i but equal 1 for
object j
• t is the number of attributes that equal 0 for both objects i and j.
• The total number of attributes is p, where p = q+r+s+t .

42
Symmetric binary dissimilarity
• Symmetric dissimilarity between i and j

Asymmetric binary dissimilarity


• Asymmetric dissimilarity between i and j

• For asymmetric binary dissimilarity, number of negative


matches, t , is considered unimportant and is thus
ignored in distance computation.

43
Method 2: Nominal attributes with asymmetric binary
attribute values

Dissimilarity Measure for Binary Attributes

• Suppose that a patient record table (Table 2.4) contains the


attributes name, gender, fever, cough, test-1, test-2, test-3,
and test-4, where name is an object identifier, gender is a
symmetric attribute, and the remaining attributes are
asymmetric binary.

44
Dissimilarity Measure for Binary
Attributes

• Let the values Y (yes) and P (positive) be set to 1,


and the value N (no or negative) be set to 0

Jaccard Similarity Coefficient for Binary


Asymmetric attribute
• Alternatively, Asymmetric binary similarity between the
objects i and j can be computed as:

• here

• Coefficient sim(i,j) is called the Jaccard


coefficient.

45
Standardizing Numeric Data
• Normalization is particularly useful for classification
algorithms involving distance measurements such as
nearest-neighbor classification and clustering.

z  x
• Z-score:
– X: raw score to be standardized, μ: mean of the
population, σ: standard deviation
– the distance between the raw score and the population
mean in units of the standard deviation
– negative when the raw score is below the mean, “+”
when above

Example:
Data Matrix and Dissimilarity Matrix
x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

46
Distance on Numeric Data: Minkowski Distance

• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
• Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
– d(i, j) = d(j, i) (Symmetry)
– d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric

Special Cases of Minkowski Distance


• h = 1: Manhattan (city block, L1 norm) distance
– E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

• h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

• h  . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component
(attribute) of the vectors

47
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0

Summary: Proximity Measure for Binary


Attributes
Object j
• A contingency table for binary data
Object i

• Distance measure for symmetric


binary variables:
• Distance measure for asymmetric
binary variables:
• Jaccard coefficient (similarity
measure for asymmetric binary
variables):
 Note: Jaccard coefficient is the same as “coherence”:

48
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
– replace xif by their rank r if  {1,..., M f }
– map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
r if  1
z 
if M f  1
– compute the dissimilarity using methods for interval-
scaled variables

Proximity Measure for mixed Attributes

• A dataset may contain all attribute types: nominal,


symmetric binary, asymmetric binary, numeric,
ordinal, etc.
• One may use weighted formula to combine their
effects.

49
Attributes of Mixed Type

• A database may contain all attribute types


– Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
– f is numeric: use the normalized distance
– f is ordinal
z 
r 1 if
• Compute ranks rif and
• Treat zif as interval-scaled
if M 1 f

Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
• It lies between -1 to +1.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

50
Example: Cosine Similarity
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
cos(d1, d2 ) = 0.94

Summary
• Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
• Many types of data sets, e.g., numerical, text, graph, Web, image.
• Gain insight into the data by:
– Basic statistical data description: central tendency, dispersion,
graphical displays
– Data visualization: map data onto graphical primitives
– Measure data similarity
• Above steps are the beginning of data preprocessing.
• Many methods have been developed but still an active area of
research.

51

You might also like