Professional Documents
Culture Documents
Exploring Data
Created by :Tan, Steinbach, Kumar
Modified by : Thanrat Sintanakul
– Virginica
– Versicolour
• Four (non-class) attributes
– Sepal width and length
– Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 6
Summary Statistics
• Summary statistics are numbers that
summarize properties of the data
– Summarized properties include frequency,
location and spread
• Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated
in a single pass through the data
displaying the
distribution of the
values of a single 75th percentile
25th percentile
– Following figure shows
the basic part of a
10th percentile
box plot
Celsius
• Data Matrix :
– is a rectangular array of values
– Can be visualized as an image by associating
each entry of the data matrix with a pixel
in the image
– The brightness or color of the pixel is
determined by the value of the
corresponding entry of the matrix
• Matrix plots
– Can plot the data matrix
– This can be useful when objects are
sorted according to class
– Typically, the attributes are normalized
to prevent one attribute from dominating
the plot
– Plots of similarity or distance matrices
can also be useful for visualizing the
relationships between objects
standard
deviation
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 56
Visualization Techniques: Matrix Plots
Correlation
• Parallel Coordinates
– Used to plot the attribute values of high-
dimensional data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted
as a point on each corresponding coordinate axis
and the points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of
objects group together, at least for some
attributes
– Ordering of attributes is important in seeing such
groupings
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 59
Parallel Coordinates Plots for Iris Data
• Example: consider a
parallel coordinates
plot of the 4
numerical attributes
of the Iris Data Set
• The plot shows that
the classes are
reasonably well
separated for the
pedal width & length,
but less well
separated for sepal
width & length
• Another parallel
coordinates plot ,
with different
ordering of the
axes, of the 4
numerical
attributes of the
Iris Data Set
•
• Star Plots
– Similar approach to parallel coordinates,
but axes radiate from a central point
– The line connecting the values of an
object is a polygon
– The size & shape of
the polygon gives a
visual description of
the attribute values
of the object
Setosa
Versicolour
Virginica
• Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute
with a characteristic of a face
– The values of each attribute determine
the appearance of the corresponding facial
characteristic
– Each object becomes a
separate face
– Relies on human’s ability to
distinguish faces
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 64
Visualization Techniques : Chernoff Faces
• Chernoff Faces(cont.)
Data Feature Facial Feature
Sepal Length Size of Face
Sepal Width Forehead/Jaw relative arc
length
Petal Length Shape of Forehead
Petal Width Shape of Jaw
• Other features of the face, e.g. width
between the eyes & length of the
mouth, are given default values
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 65
Chernoff Faces for Iris Data
Setosa
Versicolour
Virginica
Setosa Versicolour
Virginica
Reference :
• Example of OLAP Operations -
pangsida.sakaeo.buu.ac.th/~52410017
/.../Ch06OLAP_Cubes.pptx