Professional Documents
Culture Documents
Visualization
Source: Chapter 2
book: Data Mining by Han and Kamber
CSE25.8: Elective-I
MCA II Sem
March-June 2022
Content
• Data objects and Attributes
• Measures of Central tendency
• Dispersion of data
• Data Visualization
• Data proximity measures
1
Data objects
• A data object represents an entity:
– In a sales database, the objects may be customers, store items,
and sales;
– in a medical database, the objects may be patients;
– in a university database, the objects may be students,
professors, and courses.
Attributes/Feature vector
• A set of attributes used to describe a given object is
called an attribute vector (or feature vector).
• These terms are synonymous in Data science: attribute,
dimension, feature, predictor, and variable.
• The term dimension is commonly used in data
warehousing.
• Machine learning literature tends to use the term feature,
while statisticians prefer the term variable.
• Data mining and database professionals commonly use
the term attribute.
2
Data
• The distribution of data involving one
attribute/variable is called univariate.
• A bivariate distribution involves two
attributes.
• Multivariate, and so on.
Attribute Types
The type of an attribute is determined by the set of possible values the
attribute can have:
• Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude
between successive values is not known.
– Size = {small, medium, large}, grades, army rankings
3
Nominal attribute
• Nominal/categorical
– e.g., customer _ID, name, address, hair color, marital status, etc.
– possible values for hair color are black, brown, blond, red, auburn,
gray, and white.
– marital status can take values single, married, divorced, and
widowed, etc.
Binary Attribute
• A binary attribute is a nominal attribute with only two
categories or states: 0 or 1, where 0 typically means that
the attribute is absent, and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two
states correspond to true and false.
4
Ordinal attributes
• Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured
objectively; thus ordinal attributes are often used in
surveys for ratings.
– In one survey, participants were asked to rate how satisfied they
were as customers.
– Customer satisfaction had the following ordinal categories: 0:
very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3:
satisfied, and 4: very satisfied.
5
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and
represented using a finite number of digits
– Continuous attributes are typically represented as
floating-point variables
6
Basic Statistical Descriptions of Data
• Basic statistical descriptions can be used to identify
properties of the data and highlight which data values
should be treated as noise or outliers.
• Measures of central tendency
– given an attribute, where do most of its values fall e.g., mean,
median, mode, and midrange
• Dispersion of the data
– how are the data spread out e.g., range, quartiles, interquartile
range; the five-number summary and boxplots; variance and
standard deviation of the data
• Graphic displays of basic statistical descriptions to
visually inspect our data
– bar charts, pie charts, and line graphs, quantile plots, quantile–
quantile plots, histograms, and scatter plots
n
Mean (algebraic measure) (sample vs. population): 1 x
•
x xi
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
wx
–
i i
– Trimmed mean: chopping extreme values x i 1
n
• Median: w
i 1
i
14
– Empirical formula: mean mode 3 (mean median)
for unimodal numeric data that are moderately skewed
7
Trimmed Mean
• It is the mean obtained after chopping off values at
the high and low extremes.
8
Interval Median
• The median is expensive to compute when we have a large number of
observations.
• For numeric attributes, however, we can easily approximate the value.
– E.g., employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on. Let the interval that
contains the median frequency be the median interval.
• We can approximate the median of the entire data set (e.g., the median
salary) by interpolation using the formula:
n / 2 ( freq )l
median L1 ( ) width
freqmedian
• where L1 is the lower boundary of the median interval
• n is the number of values in the entire data set
• (∑freq)l is the sum of the frequencies of all of the intervals that are lower
than the median interval
• freqmedian is the frequency of the median interval
• width is the width of the median interval.
18
9
Range and Quartiles
• The range of the set is the difference between the largest
(max()) and smallest (min()) values.
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size
consecutive sets.
– The 2-quantile is the data point dividing the lower and upper
halves of the data distribution which corresponds to the median.
– The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-
fourth of the data distribution. They are more commonly
referred to as quartiles.
Interquartile Range
• The quartiles give an indication of a distribution’s center,
spread, and shape.
– The first quartile, denoted by Q1, is the 25th percentile. It cuts off
the lowest 25% of the data.
– The third quartile, denoted by Q3, is the 75th percentile—it cuts
off the lowest 75% (or highest 25%) of the data.
10
Box Plot
• A boxplot is a standardized way of displaying distribution
of data on a 5 number summary:
– Min
– first quartile (Q1)
– Median
– third quartile (Q3)
– Max
11
Boxplot Analysis
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended
to Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually
12
Variance and Standard deviation
• Variance and standard deviation are measures of
data dispersion.
• SD measures spread about the mean and should be
considered only when the mean is chosen as the
measure of center.
• SD indicates how spread out a data distribution is.
A low standard deviation means that the data
observations tend to be very close to the mean.
While a high standard deviation indicates that the data
are spread out over a large range of values.
• Variance = (S.D.)2
1 n 1 n 2 1 n 2
s2 i
n 1 i1
( x x ) 2
[ xi ( xi ) ]
n 1 i1 n i1
13
Graphic Displays of Basic Statistical
Descriptions of Data
• These include:
– quantile plots
– quantile–quantile plots
– Histograms
– scatter plots
Quantile plot
• A quantile plot is a simple and effective
way to have a first look at a univariate data
distribution.
• It displays all of the data for the given
attribute.
14
Quantile Plot
• Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
15
Gaussian or Normal distribution
• The Gaussian probability
distribution function is a kind of
pdf defined by:
( x )2
1
g ( x, , )
2
2
e
2
With μ being the mean and σ being the
standard deviation. Gaussian distribution with zero
mean and a standard deviation
Probability distributions
Height of adults in cm
.28
.24
Probability
.17
.12
.09
.06
.03
.01
<150 150 160 170 180 190 200 >210
to to to to to to
160 170 180 190 200 210
16
Properties of Normal Distribution Curve
33
Histograms
• Histograms (or frequency histograms) are at least a
century old and are widely used.
– “Histos” means pole or mast, and “gram” means chart, so a
histogram is a chart of poles.
17
Histograms
• If X is nominal, such as automobile model or item type,
then a pole or vertical bar is drawn for each known value
of X.
– The height of the bar indicates the frequency (i.e., count) of that X
value.
– The resulting graph is more commonly known as a bar chart.
Histogram
18
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars 40
• It shows what proportion of cases fall 35
into each of several categories 30
• Differs from a bar chart in that it is the 25
area of the bar that denotes the
value, not the height as in bar charts, 20
a crucial distinction when the 15
categories are not of uniform width 10
• The categories are usually specified 5
as non-overlapping intervals of some
0
variable. The categories (bars) must 10000 30000 50000 70000 90000
be adjacent
19
Example of an image and the
associated histogram
Histogram Processing
Figure showing 4
basic image types:
dark,
light,
low contrast,
high contrast,
and their
corresponding
histograms
20
Image segmentation with histograms
1 if f ( x, y ) T (objects)
g ( x, y )
0 if f ( x, y ) T
(background)
• Figure 2.7 shows a scatter plot for the set of data in Table
2.1.
21
Scatter Plot
2D Scatter Plot
through this
visualization, we can
see that points of
types “+” and “x”
tend to be co-
located.
22
Scatter plot Train v/s Test data
23
Scatter Plots
24
Uncorrelated Data
49
Correlation Coefficient
• Correlation coefficient (r) ranges from -1 to +1 e.g., +1
and -1 indicate close relation.
• Pearson correlation technique works with linear
relationships only e.g. one variable gets larger, the other
variable gets larger/smaller in direct proportion.
• It fails for curvilinear relationships.
– Example of curvilinear relationship is age and healthcare; young
children and older people both tend to use much more
healthcare than teenager/young adults.
25
Data Visualization
• Why data visualization?
– Gain insight into an information space by mapping data onto
graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities, relationships
among data
– Help find interesting regions and suitable parameters for further
quantitative analysis
– Provide a visual proof of computer representations derived
• Categorization of visualization methods:
– Pixel-oriented visualization techniques
– Geometric projection visualization techniques
– Icon-based visualization techniques
– Hierarchical visualization techniques
– Visualizing complex data and relations
Pixel-Oriented Visualization
Techniques
• A simple way to visualize the value of a dimension is to
use a pixel where the color of the pixel reflects the
dimension’s value.
• For a data set of m dimensions, pixel-oriented
techniques create m windows on the screen, one for
each dimension.
• The m dimension values of a record are mapped to m
pixels at the corresponding positions in the windows.
• The colors of the pixels reflect the corresponding values.
26
e.g., Pixel-Oriented Visualization
• AllElectronics maintains a customer information table,
which consists of four dimensions: income, credit limit,
transaction volume, and age.
Pixel-Oriented Visualization
27
Geometric Projection Visualization
• A drawback of pixel-oriented visualization techniques is
that they cannot help us much in understanding the
distribution of data in a multi-dimensional space.
– e.g., they do not show whether there is a dense area in a
multidimensional subspace.
Geometric projection
using 2D Scatter Plot
A third dimension
in scatter plot can
be added using
different colors or
shapes to
represent different
data points.
Figure shows an
example, where X and
Y are two spatial
attributes and the third
dimension is
represented by
different shapes.
28
Geometric projection
A 3-D scatter plot using 3D Scatter Plot
uses three axes in
a Cartesian
coordinate system.
If it also uses
color, it can
display up to 4-D
data points
(Figure 2.14).
Figure 2.14 Visualization of a 3-D data set using a scatter plot. Source:
http://upload.wikimedia.org/wikipedia/commons/c/c4/Scatter plot.jpg.
A data record is represented by a polygonal line that intersects
29
Geometric projection using scatter-plot matrix
30
Icon-Based Visualization Techniques
• Icon-based visualization techniques use small icons to
represent multidimensional data values.
31
Figure 2.17 Chernoff faces
Icon-Based Visualization
Chernoff faces
• Chernoff faces make use of the ability of the
human mind to recognize small differences in
facial characteristics and to assimilate many
facial characteristics at once.
• Viewing large tables of data can be tedious.
• By condensing the data, Chernoff faces make
the data easier for users to digest.
32
e.g., Chernoff visualization
33
Stick figure visualization
• The stick figure visualization
technique maps multidimensional
data to five-piece stick figures,
– where each figure has four
limbs and a body.
34
Stick figure visualization
Figure 2.18 Census data represented using stick figures. Source: Professor
G. Grinstein, Department of Computer Science, University of Massachusetts
at Lowell.
35
Dendrogram
dendrogram
36
e.g., Dendrogram
VOSviewer
• VOSviewer is a software tool for constructing and
visualizing bibliometric networks.
37
VOS viewer
38
Similarity and Dissimilarity
• For algorithms such as clustering, outlier analysis, and
nearest-neighbor classification, we need ways to assess how
alike or unalike objects are in comparison to one another.
• Outlier analysis also employs clustering-based techniques to
identify potential outliers as objects that are highly dissimilar
to others.
• For example,
– a store may want to search for clusters of customer objects, resulting
in groups of customers with similar characteristics (e.g., similar
income, area of residence, and age). Such information can then be
used for marketing.
– Knowledge of object similarities can also be used in nearest-neighbor
classification schemes where a given object (e.g., a patient) is
assigned a class label (relating to, say, a diagnosis) based on its
similarity toward other objects in the model.
39
Data Matrix and Dissimilarity Matrix
• Data matrix
– n data points with p
dimensions x 11 ... x 1f ... x 1p
– Two modes (it has 2 ... ... ... ... ...
entities rows and x ... x if ... x ip
i1
columns) ... ... ... ... ...
x ... x nf ... x np
• Dissimilarity matrix n1
– n data points, but
registers only the 0
distance d(2,1) 0
– A triangular matrix d(3,1 ) d ( 3 ,2 ) 0
– Single mode (it has one : : :
kind of entity dissimilarity) d ( n ,1 ) d ( n,2 ) ... ... 0
dissimilarity or “difference”
• d(i, j) is the measured dissimilarity or “difference”
between objects i and j.
– d(i, j) is a non-negative number that is close to 0 when
objects i and j are highly similar or “near” each other,
– it becomes larger the more they differ.
– d(i, j)=0; that is, the difference between an object and
itself is 0.
• Measures of similarity can often be expressed as a
function of measures of dissimilarity.
• For example, for nominal data,
40
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue,
green (generalization of a binary attribute)
• Method 1: Simple matching
– m: # of matches, p: total # of variables/attributes
d ( i , j ) p p m
• Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M
nominal states
Alternatively, similarity can be computed as
41
Symmetric v/s Asymmetric Binary attributes
• Symmetric binary attribute: a binary attribute has only
one of two states: 0 and 1, where 0 means that the
attribute is absent, and 1 means that it is present
– e.g., 1 indicates that the patient smokes, while 0 indicates that
the patient does not
42
Symmetric binary dissimilarity
• Symmetric dissimilarity between i and j
43
Method 2: Nominal attributes with asymmetric binary
attribute values
44
Dissimilarity Measure for Binary
Attributes
• here
45
Standardizing Numeric Data
• Normalization is particularly useful for classification
algorithms involving distance measurements such as
nearest-neighbor classification and clustering.
z x
• Z-score:
– X: raw score to be standardized, μ: mean of the
population, σ: standard deviation
– the distance between the raw score and the population
mean in units of the standard deviation
– negative when the raw score is below the mean, “+”
when above
Example:
Data Matrix and Dissimilarity Matrix
x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
46
Distance on Numeric Data: Minkowski Distance
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
• Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
– d(i, j) = d(j, i) (Symmetry)
– d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
47
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
48
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
– replace xif by their rank r if {1,..., M f }
– map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
r if 1
z
if M f 1
– compute the dissimilarity using methods for interval-
scaled variables
49
Attributes of Mixed Type
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
• It lies between -1 to +1.
50
Example: Cosine Similarity
• cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
cos(d1, d2 ) = 0.94
Summary
• Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
• Many types of data sets, e.g., numerical, text, graph, Web, image.
• Gain insight into the data by:
– Basic statistical data description: central tendency, dispersion,
graphical displays
– Data visualization: map data onto graphical primitives
– Measure data similarity
• Above steps are the beginning of data preprocessing.
• Many methods have been developed but still an active area of
research.
51