Chapter 3 Data Exploration

Data Mining:
Exploring Data
Created by :Tan, Steinbach, Kumar
Modified by : Thanrat Sintanakul
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 1

What is Data Exploration?
A preliminary exploration of the data to better
understand its characteristics.
• Key motivations of data exploration include
– Helping to select the right tool for preprocessing
or analysis
– Making use of humans’ abilities to recognize
patterns
• People can recognize patterns not
captured by data analysis tools

• 3 Major Topics of Exploring Data;
– Summary Statistics : X, S.D.
– Visualization Techniques : Histograms,
Scatter Plots
– On-Line Analytical Processing(OLAP) : a
set of techniques for exploring
multidimensional arrays of values

• OLAP-related analysis fn focus on various
ways to create summary data tables from a
multidimensional data array, by aggregating
data either across various dimensions or
across various attribute values
• For instance, considering sales info.
reported according to product, location, &
date
– OLAP techniques can be used to create a
summary that describes that sales activity at a
particular location by month & product category

Techniques Used In Data Exploration
• In EDA(Exploratory Data Analysis), as

originally defined by John Tukey
– The focus was on visualization
– Clustering and anomaly detection were
viewed as exploratory techniques
• In data mining, clustering and anomaly

detection are major areas of interest,
and not thought of as just
exploratory

Iris Sample Data Set
• Many of the exploratory data techniques
are illustrated with the Iris Plant data set.
– 5 Attributes;
• Three flower types (classes):
– Setosa
– Virginica
– Versicolour
• Four (non-class) attributes
– Sepal width and length
– Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Summary Statistics
• Summary statistics are numbers that
summarize properties of the data
– Summarized properties include frequency,
location and spread
• Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated
in a single pass through the data

Summary Statistics
• For example;
– The average household income
– The fraction(percentage) of college
students who complete an undergraduate
degree in 4 years

Frequency and Mode
• The frequency of an attribute value is
the percentage of time the value
occurs in the data set
– For example, given the attribute ‘gender’
and a representative population of people,
the gender ‘female’ occurs about 50% of
the time.
• The mode of an attribute is the most
frequent attribute value

Frequency and Mode
No. Gender Department
1 M Electronics
2 F Information Technology
4 F Business Computer
5 M Electronics
6 M Business Computer
10
Data Mining F Thanrat Sintanakul Business Computer
Department of Computer Education KMUTNB 10
Frequency and Mode
• The notions of frequency and mode are
typically used with Categorical Data
• For Categorical Attributes
Frequency & Mode can be interesting &
useful
• For Continuous Data Mode is often
not useful because a single value may
not occur > 1 time

Percentiles
• For continuous data, the notion of a
percentile is more useful.
• Given an ordinal or continuous attribute
x and a number p between 0 and 100,
the pth percentile is a value xp of x such
xp
that p% of the observed values of x

are less than xp .
• For instance, the 50th percentile is
the value x50% such that 50% of all
values of x are less than x50% .
Percentiles
• See example in the sheet

Measures of Location: Mean and Median
• The mean is the most common measure of

the location of a set of points.
• However, the mean is very sensitive to
outliers.
• Thus, the median or a trimmed mean is also
commonly used.

• Mean & Median are 2 of the most
widely used for Continuous Data
• Example: consider the height of 15
students, as follows;
• {130,121,123,124,123,127,131,125,
122,120,127,128,125,124,124}
• Mean = Σxi/x = 1874/15 = 124.93
• Median;
– 120,121,122,123,123,124,124,124.125,
125,127,127,128,130,131
• Mean can be interpreted as the middle

of a set of values, in case of;
– The values are distributed in a symmetric
manner (Normal Curve)
– No outliers in such data set
• In opposite, Median can be better
used, in case of;
– The distribution of values is skewed
– Data with outliers
• To overcome problems with the

traditional definition of a mean, the
notion of a Trimmed Mean is
sometimes used
• A percentage p between 0 & 100 is
specified, the top & bottom (p/2)% of
the data set is thrown out

• Example: consider the set of values

{1,2,3,4,5,90}
ไมแทนคากลาง
• Traditional Mean = 17.5 ที่เปนจริง
• Traditional Median = 3.5
• Trimmed Mean with p=40% is
– ตัดคานอยที่สุด 20% (=1.2 ค่ า 1 ค่ า) และคามาก
ที่สุด 20% (=1.2 ค่ า 1 ค่ า) ออก จะได {2,3,4,5}
– แลวคิดคา Mean ตามปกติ จะได = 3.5

Measures of Spread: Range and Variance
• Commonly used for Continuous Data

• Measure the dispersion or spread of a
set of values, by indicating if the
attribute values are widely spread out
or if they are relatively concentrated
around a single point, e.g. the Mean
• Range is the difference between the
Max. and Min. value;
Range(x) = Max(x) – Min(x) = xm – x1
• Range can be misleading if most of

the values are concentrated in a
narrow band of values, but there are
also a relatively small no. of more
extreme values(outliers)
• The Variance or Standard Deviation is
the most common measure of the
spread of a set of points.

• The Standard Deviation(S.D.) is the

square root of the Variance (is written
as Sx)
• The Mean can be distorted by
Outliers, and since the Variance is
computed using the Mean, it’s also
sensitive to Outliers

• So that other measures are often

used;
– AAD : Absolute Average Deviation
– MAD : Median Absolute Deviation
– IQR : Interquartile Range

Visualization
• Visualization is the conversion of data
into a visual or tabular format so that
the characteristics of the data and the
relationships among data items or
attributes can be analyzed or reported.
• Sometimes the use of visualization
techniques in data mining is referred to
as Visual Data Mining

Visualization
• Visualization of data is one of the
most powerful and appealing techniques
for data exploration.
– Humans have a well developed ability to
analyze large amounts of information that
is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns

Example: Sea Surface Temperature
• The following shows the Sea Surface
Temperature (SST) for July 1982
– Tens of thousands of data points are
summarized in a single figure

Representation
• Is the mapping of information to a visual
format
• Data objects, their attributes, and the
relationships among data objects are
translated into graphical elements such as
points, lines, shapes, and colors.
• Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of
the points, e.g., color, size, and shape
– If position is used, then the relationships of
points, i.e., whether they form groups or a point
is an outlier, is easily perceived.
Arrangement
• Is the placement of visual elements
within a display
• Can make a large difference in how
easy it is to understand the data
• Example:

Arrangement

Selection
• Is the elimination or the de-emphasis
of certain objects and attributes
• Selection may involve the choosing a
subset of attributes
– Dimensionality reduction is often used to
reduce the number of dimensions to two or
three
– Alternatively, pairs of attributes can be
considered
Selection
• Selection may also involve choosing a
subset of objects
– A region of the screen can only show so
many points
– Can sample, but want to preserve points
in sparse areas

Visualization Techniques
• 3 categories of visualization
techniques;
– Visualization of data with small no. of
attributes
– Visualization of data with spatial and/or
temporal attributes
– Visualization of data with many
attributes

• Visualization of data with small no.
of attributes;
• Stem & Leaf Plots
• Histograms
• Box Plots
• Pie Chart
• Scatter Plots

Visualization Techniques: Stem & Leaf Plots
• Can be used to provide insight into

the distribution of one-dimensional
integer or continuous data
• example: consider the set of integers
of the sepal length (from the Iris
Data Set) in cm.

Visualization Techniques: Stem & Leaf Plots

Visualization Techniques: Histograms
• Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the
number of objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)

Two-Dimensional Histograms
• Show the joint distribution of the
values of two attributes
• Each attribute is divided into intervals
and the 2 sets of intervals define
2-dimensional rectangles of values
• Can be used to discovered interesting
facts about how the values of 2
attributes co-occur

• Example: petal width and petal length
– What does this tell us?

• Example(cont.):
– Shows a 2-dimensional histogram of petal
length & petal width (from Iris Data Set)
– Because each attribute is split into 3 bins,
there are 9 rectangular 2-dimensional bins
– Height of each rectangular bar indicates
no. of objects(flower in this case) that fall
into each bin
– Most of the flowers fall into only 3 of the
bins – along the diagonal
Visualization Techniques: Box Plots
• Box Plots outlier
– Another way of 90th percentile
displaying the
distribution of the
values of a single 75th percentile
numerical attribute 50th percentile
25th percentile
– Following figure shows
the basic part of a
10th percentile
box plot

Example of Box Plots
• Box plots can be used to compare attributes

Example of Box Plots
• Box Plots can also be used to compare
how attributes vary between different
classes of objects

Visualization Techniques: Pie Chart
• Typically used with
Categorical
Attributes that have
a relatively small no.
of value
• Uses the relative
area of a circle to
indicate relative
freq.

Visualization Techniques: Scatter Plots
• Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots are most
common, but can have three-dimensional
scatter plots
– Often, additional attributes can be displayed
by using the size, shape, and color of the
markers that represent the objects
– It is useful to have arrays of scatter plots,
because can compactly summarize the
relationships of several pairs of attributes
• See example on the next slide
Scatter Plot Array of Iris Attributes

3-Dimensional Scatter Plot

• Visualization of Spatial – Temporal
Data;
• Contour Plots
• Surface Plots
• Vector Field Plots
• Lower-Dimensional Slices
• Animation

Visualization Techniques: Contour Plots
• Contour plots
– Useful when a continuous attribute is measured
on a spatial grid
– They partition the plane into regions of similar
values
– The contour lines that form the boundaries of
these regions connect points with equal values
– The most common example is contour maps of
elevation
– Can also display temperature, rainfall, air
pressure, etc.
• An example for Sea Surface Temperature (SST) is
provided on the next slide

Contour Plot Example: SST Dec, 1998
Celsius

Visualization Techniques: Surface Plots

Visualization Techniques: Vector Field Plots

Visualization Techniques: Lower-Dimensional Slices

• Visualization of Higher Dimensional
Data;
• Matrix Plots
• Parallel Coordinates
• Star Coordinates & Chernoff Faces

Visualization Techniques: Matrix Plots
• Data Matrix :
– is a rectangular array of values
– Can be visualized as an image by associating
each entry of the data matrix with a pixel
in the image
– The brightness or color of the pixel is
determined by the value of the
corresponding entry of the matrix

• Matrix plots
– Can plot the data matrix
– This can be useful when objects are
sorted according to class
– Typically, the attributes are normalized
to prevent one attribute from dominating
the plot
– Plots of similarity or distance matrices
can also be useful for visualizing the
relationships between objects

• Example: consider the standardized
data matrix for the Iris Data Set
• 3 species;
– 1st 50 rows – Setosa
– 2nd 50 rows – Versicolour
– 3rd 50 rows – Virginica
• Petal width & length;
– Setosa < average
– Versicolour ≈ average
– Virginica > average
Visualization of the Iris Data Matrix
standard
deviation
• Example: consider the correlation

matrix for the Iris Data Set
• The flowers in each group are most
similar to each other
• But Versicolour & Virginica are more
similar to one another than to Setosa

Visualization of the Iris Correlation Matrix
Correlation

Visualization Techniques: Parallel Coordinates
• Parallel Coordinates
– Used to plot the attribute values of high-
dimensional data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted
as a point on each corresponding coordinate axis
and the points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of
objects group together, at least for some
attributes
– Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots for Iris Data
• Example: consider a
parallel coordinates
plot of the 4
numerical attributes
of the Iris Data Set
• The plot shows that
the classes are
reasonably well
separated for the
pedal width & length,
but less well
separated for sepal
width & length

Parallel Coordinates Plots for Iris Data
• Another parallel
coordinates plot ,
with different
ordering of the
axes, of the 4
numerical
attributes of the
Iris Data Set
•

Visualization Techniques : Star Plot
• Star Plots
– Similar approach to parallel coordinates,
but axes radiate from a central point
– The line connecting the values of an
object is a polygon
– The size & shape of
the polygon gives a
visual description of
the attribute values
of the object

Star Plots for Iris Data
Setosa
Versicolour
Virginica

Visualization Techniques : Chernoff Faces
• Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute
with a characteristic of a face
– The values of each attribute determine
the appearance of the corresponding facial
characteristic
– Each object becomes a
separate face
– Relies on human’s ability to
distinguish faces
Visualization Techniques : Chernoff Faces
• Chernoff Faces(cont.)
Data Feature Facial Feature
Sepal Length Size of Face
Sepal Width Forehead/Jaw relative arc
length
Petal Length Shape of Forehead
Petal Width Shape of Jaw
• Other features of the face, e.g. width
between the eyes & length of the
mouth, are given default values
Chernoff Faces for Iris Data
Setosa
Versicolour
Virginica

On-Line Analytical Processing - OLAP
• On-Line Analytical Processing (OLAP)
was proposed by E. F. Codd, the
father of the relational database.
• Relational databases put data into
tables, while OLAP uses a
multidimensional array representation.
– Such representations of data previously
existed in statistics and other fields
• There are a number of data analysis
and data exploration operations that
are easier with such a data
representation.
Creating a Multidimensional Array
• Two key steps in converting tabular
data into a multidimensional array.
– 1st, identify which attributes are to be
the dimensions and which attribute is to
be the target attribute whose values
appear as entries in the multidimensional
array.
• The attributes used as dimensions must have
discrete values
• The target value is typically a count or
continuous value, e.g., the cost of an item
• Can have no target variable at all except the
count of objects that have the same
set of attribute values
Creating a Multidimensional Array
• Two key steps in converting tabular
data into a multidimensional
array(cont.)
– 2nd, find the value of each entry in the
multidimensional array by summing the
values (of the target attribute) or count
of all objects that have the attribute
values corresponding to that entry.

Example: Iris data
• Show how the attributes, petal length,
petal width, and species type can be
converted to a multidimensional array
– First, discretized the petal width and
length to have categorical values: low,
medium, and high
– Then get the following table - note the
count attribute

Example: Iris data

Example: Iris data
• Each unique tuple of petal width, petal
length, and species type identifies one
element of the array.
• This element is
assigned the
corresponding count
value.
• The figure illustrates
the result.
• All non-specified
tuples are 0.
Example: Iris data
• Slices of the multidimensional array
are shown by the following cross-
tabulations
Setosa Versicolour
Virginica

OLAP Operations
• 4 types of OLAP Operations;
– Aggregation
– Pivoting
– Slicing & Dicing
– Roll-Up & Drill-Down

OLAP Operations : Aggregation

OLAP Operations : Pivoting

OLAP Operations : Pivoting

OLAP Operations : Slicing

OLAP Operations : Slicing

OLAP Operations : Dicing

OLAP Operations : Dicing

OLAP Operations : Roll-Up & Drill-Down

OLAP Operations : Roll-Up & Drill-Down

OLAP Operations
Reference :
• Example of OLAP Operations -
pangsida.sakaeo.buu.ac.th/~52410017
/.../Ch06OLAP_Cubes.pptx

Chapter 3 Data Exploration

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 Data Exploration

Uploaded by

Copyright:

Available Formats

Data Mining:

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 1

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 2

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 3

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 4

• In EDA(Exploratory Data Analysis), as

• In data mining, clustering and anomaly

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 5

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 7

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 8

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 9

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 11

that p% of the observed values of x

• See example in the sheet

• The mean is the most common measure of

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 14

• Mean can be interpreted as the middle

• To overcome problems with the

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 17

• Example: consider the set of values

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 18

• Commonly used for Continuous Data

• Range can be misleading if most of

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 20

• The Standard Deviation(S.D.) is the

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 21

• So that other measures are often

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 22

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 23

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 24

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 25

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 27

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 28

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 30

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 31

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 32

• Can be used to provide insight into

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 33

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 34

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 35

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 36

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 37

• Box Plots outlier

– Another way of 90th percentile

numerical attribute 50th percentile

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 39

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 40

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 41

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 42

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 44

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 45

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 46

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 47

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 48

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 49

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 50

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 51

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 52

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 53

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 54

• Example: consider the correlation

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 57

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 58

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 60