You are on page 1of 84

Data Mining:

Exploring Data
Created by :Tan, Steinbach, Kumar
Modified by : Thanrat Sintanakul

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 1


What is Data Exploration?
A preliminary exploration of the data to better
understand its characteristics.
• Key motivations of data exploration include
– Helping to select the right tool for preprocessing
or analysis
– Making use of humans’ abilities to recognize
patterns
• People can recognize patterns not
captured by data analysis tools

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 2


What is Data Exploration?
• 3 Major Topics of Exploring Data;
– Summary Statistics : X, S.D.
– Visualization Techniques : Histograms,
Scatter Plots
– On-Line Analytical Processing(OLAP) : a
set of techniques for exploring
multidimensional arrays of values

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 3


What is Data Exploration?
• OLAP-related analysis fn focus on various
ways to create summary data tables from a
multidimensional data array, by aggregating
data either across various dimensions or
across various attribute values
• For instance, considering sales info.
reported according to product, location, &
date
– OLAP techniques can be used to create a
summary that describes that sales activity at a
particular location by month & product category

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 4


Techniques Used In Data Exploration

• In EDA(Exploratory Data Analysis), as


originally defined by John Tukey
– The focus was on visualization
– Clustering and anomaly detection were
viewed as exploratory techniques

• In data mining, clustering and anomaly


detection are major areas of interest,
and not thought of as just
exploratory

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 5


Iris Sample Data Set
• Many of the exploratory data techniques
are illustrated with the Iris Plant data set.
– 5 Attributes;
• Three flower types (classes):
– Setosa

– Virginica
– Versicolour
• Four (non-class) attributes
– Sepal width and length
– Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 6
Summary Statistics
• Summary statistics are numbers that
summarize properties of the data
– Summarized properties include frequency,
location and spread
• Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated
in a single pass through the data

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 7


Summary Statistics
• For example;
– The average household income
– The fraction(percentage) of college
students who complete an undergraduate
degree in 4 years

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 8


Frequency and Mode
• The frequency of an attribute value is
the percentage of time the value
occurs in the data set
– For example, given the attribute ‘gender’
and a representative population of people,
the gender ‘female’ occurs about 50% of
the time.
• The mode of an attribute is the most
frequent attribute value

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 9


Frequency and Mode
No. Gender Department
1 M Electronics
2 F Information Technology
3 F Information Technology
4 F Business Computer
5 M Electronics
6 M Business Computer
7 F Business Computer
8 F Information Technology
9 F Business Computer
10
Data Mining F Thanrat Sintanakul Business Computer
Department of Computer Education KMUTNB 10
Frequency and Mode
• The notions of frequency and mode are
typically used with Categorical Data
• For Categorical Attributes
Frequency & Mode can be interesting &
useful
• For Continuous Data Mode is often
not useful because a single value may
not occur > 1 time

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 11


Percentiles
• For continuous data, the notion of a
percentile is more useful.
• Given an ordinal or continuous attribute
x and a number p between 0 and 100,
the pth percentile is a value xp of x such
xp

that p% of the observed values of x


are less than xp .
• For instance, the 50th percentile is
the value x50% such that 50% of all
values of x are less than x50% .
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 12
Percentiles

• See example in the sheet


Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 13
Measures of Location: Mean and Median

• The mean is the most common measure of


the location of a set of points.
• However, the mean is very sensitive to
outliers.
• Thus, the median or a trimmed mean is also
commonly used.

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 14


Measures of Location: Mean and Median
• Mean & Median are 2 of the most
widely used for Continuous Data
• Example: consider the height of 15
students, as follows;
• {130,121,123,124,123,127,131,125,
122,120,127,128,125,124,124}
• Mean = Σxi/x = 1874/15 = 124.93
• Median;
– 120,121,122,123,123,124,124,124.125,
125,127,127,128,130,131
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 15
Measures of Location: Mean and Median

• Mean can be interpreted as the middle


of a set of values, in case of;
– The values are distributed in a symmetric
manner (Normal Curve)
– No outliers in such data set
• In opposite, Median can be better
used, in case of;
– The distribution of values is skewed
– Data with outliers
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 16
Measures of Location: Mean and Median

• To overcome problems with the


traditional definition of a mean, the
notion of a Trimmed Mean is
sometimes used
• A percentage p between 0 & 100 is
specified, the top & bottom (p/2)% of
the data set is thrown out

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 17


Measures of Location: Mean and Median

• Example: consider the set of values


{1,2,3,4,5,90}
ไมแทนคากลาง
• Traditional Mean = 17.5 ที่เปนจริง
• Traditional Median = 3.5
• Trimmed Mean with p=40% is
– ตัดคานอยที่สุด 20% (=1.2 ค่ า 1 ค่ า) และคามาก
ที่สุด 20% (=1.2 ค่ า 1 ค่ า) ออก จะได {2,3,4,5}
– แลวคิดคา Mean ตามปกติ จะได = 3.5

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 18


Measures of Spread: Range and Variance

• Commonly used for Continuous Data


• Measure the dispersion or spread of a
set of values, by indicating if the
attribute values are widely spread out
or if they are relatively concentrated
around a single point, e.g. the Mean
• Range is the difference between the
Max. and Min. value;
Range(x) = Max(x) – Min(x) = xm – x1
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 19
Measures of Spread: Range and Variance

• Range can be misleading if most of


the values are concentrated in a
narrow band of values, but there are
also a relatively small no. of more
extreme values(outliers)
• The Variance or Standard Deviation is
the most common measure of the
spread of a set of points.

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 20


Measures of Spread: Range and Variance

• The Standard Deviation(S.D.) is the


square root of the Variance (is written
as Sx)
• The Mean can be distorted by
Outliers, and since the Variance is
computed using the Mean, it’s also
sensitive to Outliers

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 21


Measures of Spread: Range and Variance

• So that other measures are often


used;
– AAD : Absolute Average Deviation
– MAD : Median Absolute Deviation
– IQR : Interquartile Range

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 22


Visualization
• Visualization is the conversion of data
into a visual or tabular format so that
the characteristics of the data and the
relationships among data items or
attributes can be analyzed or reported.
• Sometimes the use of visualization
techniques in data mining is referred to
as Visual Data Mining

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 23


Visualization
• Visualization of data is one of the
most powerful and appealing techniques
for data exploration.
– Humans have a well developed ability to
analyze large amounts of information that
is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 24


Example: Sea Surface Temperature
• The following shows the Sea Surface
Temperature (SST) for July 1982
– Tens of thousands of data points are
summarized in a single figure

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 25


Representation
• Is the mapping of information to a visual
format
• Data objects, their attributes, and the
relationships among data objects are
translated into graphical elements such as
points, lines, shapes, and colors.
• Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of
the points, e.g., color, size, and shape
– If position is used, then the relationships of
points, i.e., whether they form groups or a point
is an outlier, is easily perceived.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 26
Arrangement
• Is the placement of visual elements
within a display
• Can make a large difference in how
easy it is to understand the data
• Example:

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 27


Arrangement

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 28


Selection
• Is the elimination or the de-emphasis
of certain objects and attributes
• Selection may involve the choosing a
subset of attributes
– Dimensionality reduction is often used to
reduce the number of dimensions to two or
three
– Alternatively, pairs of attributes can be
considered
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 29
Selection
• Selection may also involve choosing a
subset of objects
– A region of the screen can only show so
many points
– Can sample, but want to preserve points
in sparse areas

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 30


Visualization Techniques
• 3 categories of visualization
techniques;
– Visualization of data with small no. of
attributes
– Visualization of data with spatial and/or
temporal attributes
– Visualization of data with many
attributes

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 31


Visualization Techniques
• Visualization of data with small no.
of attributes;
• Stem & Leaf Plots
• Histograms
• Box Plots
• Pie Chart
• Scatter Plots

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 32


Visualization Techniques: Stem & Leaf Plots

• Can be used to provide insight into


the distribution of one-dimensional
integer or continuous data
• example: consider the set of integers
of the sepal length (from the Iris
Data Set) in cm.

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 33


Visualization Techniques: Stem & Leaf Plots

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 34


Visualization Techniques: Histograms
• Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the
number of objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 35


Two-Dimensional Histograms
• Show the joint distribution of the
values of two attributes
• Each attribute is divided into intervals
and the 2 sets of intervals define
2-dimensional rectangles of values
• Can be used to discovered interesting
facts about how the values of 2
attributes co-occur

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 36


Two-Dimensional Histograms
• Example: petal width and petal length
– What does this tell us?

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 37


Two-Dimensional Histograms
• Example(cont.):
– Shows a 2-dimensional histogram of petal
length & petal width (from Iris Data Set)
– Because each attribute is split into 3 bins,
there are 9 rectangular 2-dimensional bins
– Height of each rectangular bar indicates
no. of objects(flower in this case) that fall
into each bin
– Most of the flowers fall into only 3 of the
bins – along the diagonal
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 38
Visualization Techniques: Box Plots

• Box Plots outlier

– Another way of 90th percentile

displaying the
distribution of the
values of a single 75th percentile

numerical attribute 50th percentile

25th percentile
– Following figure shows
the basic part of a
10th percentile
box plot

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 39


Example of Box Plots
• Box plots can be used to compare attributes

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 40


Example of Box Plots
• Box Plots can also be used to compare
how attributes vary between different
classes of objects

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 41


Visualization Techniques: Pie Chart
• Typically used with
Categorical
Attributes that have
a relatively small no.
of value
• Uses the relative
area of a circle to
indicate relative
freq.

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 42


Visualization Techniques: Scatter Plots
• Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots are most
common, but can have three-dimensional
scatter plots
– Often, additional attributes can be displayed
by using the size, shape, and color of the
markers that represent the objects
– It is useful to have arrays of scatter plots,
because can compactly summarize the
relationships of several pairs of attributes
• See example on the next slide
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 43
Scatter Plot Array of Iris Attributes

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 44


3-Dimensional Scatter Plot

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 45


Visualization Techniques
• Visualization of Spatial – Temporal
Data;
• Contour Plots
• Surface Plots
• Vector Field Plots
• Lower-Dimensional Slices
• Animation

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 46


Visualization Techniques: Contour Plots
• Contour plots
– Useful when a continuous attribute is measured
on a spatial grid
– They partition the plane into regions of similar
values
– The contour lines that form the boundaries of
these regions connect points with equal values
– The most common example is contour maps of
elevation
– Can also display temperature, rainfall, air
pressure, etc.
• An example for Sea Surface Temperature (SST) is
provided on the next slide

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 47


Contour Plot Example: SST Dec, 1998

Celsius

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 48


Visualization Techniques: Surface Plots

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 49


Visualization Techniques: Vector Field Plots

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 50


Visualization Techniques: Lower-Dimensional Slices

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 51


Visualization Techniques
• Visualization of Higher Dimensional
Data;
• Matrix Plots
• Parallel Coordinates
• Star Coordinates & Chernoff Faces

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 52


Visualization Techniques: Matrix Plots

• Data Matrix :
– is a rectangular array of values
– Can be visualized as an image by associating
each entry of the data matrix with a pixel
in the image
– The brightness or color of the pixel is
determined by the value of the
corresponding entry of the matrix

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 53


Visualization Techniques: Matrix Plots

• Matrix plots
– Can plot the data matrix
– This can be useful when objects are
sorted according to class
– Typically, the attributes are normalized
to prevent one attribute from dominating
the plot
– Plots of similarity or distance matrices
can also be useful for visualizing the
relationships between objects

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 54


Visualization Techniques: Matrix Plots
• Example: consider the standardized
data matrix for the Iris Data Set
• 3 species;
– 1st 50 rows – Setosa
– 2nd 50 rows – Versicolour
– 3rd 50 rows – Virginica
• Petal width & length;
– Setosa < average
– Versicolour ≈ average
– Virginica > average
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 55
Visualization of the Iris Data Matrix

standard
deviation
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 56
Visualization Techniques: Matrix Plots

• Example: consider the correlation


matrix for the Iris Data Set
• The flowers in each group are most
similar to each other
• But Versicolour & Virginica are more
similar to one another than to Setosa

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 57


Visualization of the Iris Correlation Matrix

Correlation

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 58


Visualization Techniques: Parallel Coordinates

• Parallel Coordinates
– Used to plot the attribute values of high-
dimensional data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted
as a point on each corresponding coordinate axis
and the points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of
objects group together, at least for some
attributes
– Ordering of attributes is important in seeing such
groupings
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 59
Parallel Coordinates Plots for Iris Data
• Example: consider a
parallel coordinates
plot of the 4
numerical attributes
of the Iris Data Set
• The plot shows that
the classes are
reasonably well
separated for the
pedal width & length,
but less well
separated for sepal
width & length

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 60


Parallel Coordinates Plots for Iris Data

• Another parallel
coordinates plot ,
with different
ordering of the
axes, of the 4
numerical
attributes of the
Iris Data Set

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 61


Visualization Techniques : Star Plot

• Star Plots
– Similar approach to parallel coordinates,
but axes radiate from a central point
– The line connecting the values of an
object is a polygon
– The size & shape of
the polygon gives a
visual description of
the attribute values
of the object

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 62


Star Plots for Iris Data

Setosa

Versicolour

Virginica

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 63


Visualization Techniques : Chernoff Faces

• Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute
with a characteristic of a face
– The values of each attribute determine
the appearance of the corresponding facial
characteristic
– Each object becomes a
separate face
– Relies on human’s ability to
distinguish faces
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 64
Visualization Techniques : Chernoff Faces

• Chernoff Faces(cont.)
Data Feature Facial Feature
Sepal Length Size of Face
Sepal Width Forehead/Jaw relative arc
length
Petal Length Shape of Forehead
Petal Width Shape of Jaw
• Other features of the face, e.g. width
between the eyes & length of the
mouth, are given default values
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 65
Chernoff Faces for Iris Data
Setosa

Versicolour

Virginica

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 66


On-Line Analytical Processing - OLAP
• On-Line Analytical Processing (OLAP)
was proposed by E. F. Codd, the
father of the relational database.
• Relational databases put data into
tables, while OLAP uses a
multidimensional array representation.
– Such representations of data previously
existed in statistics and other fields
• There are a number of data analysis
and data exploration operations that
are easier with such a data
representation.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 67
Creating a Multidimensional Array
• Two key steps in converting tabular
data into a multidimensional array.
– 1st, identify which attributes are to be
the dimensions and which attribute is to
be the target attribute whose values
appear as entries in the multidimensional
array.
• The attributes used as dimensions must have
discrete values
• The target value is typically a count or
continuous value, e.g., the cost of an item
• Can have no target variable at all except the
count of objects that have the same
set of attribute values
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 68
Creating a Multidimensional Array
• Two key steps in converting tabular
data into a multidimensional
array(cont.)
– 2nd, find the value of each entry in the
multidimensional array by summing the
values (of the target attribute) or count
of all objects that have the attribute
values corresponding to that entry.

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 69


Example: Iris data
• Show how the attributes, petal length,
petal width, and species type can be
converted to a multidimensional array
– First, discretized the petal width and
length to have categorical values: low,
medium, and high
– Then get the following table - note the
count attribute

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 70


Example: Iris data

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 71


Example: Iris data
• Each unique tuple of petal width, petal
length, and species type identifies one
element of the array.
• This element is
assigned the
corresponding count
value.
• The figure illustrates
the result.
• All non-specified
tuples are 0.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 72
Example: Iris data
• Slices of the multidimensional array
are shown by the following cross-
tabulations

Setosa Versicolour

Virginica

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 73


OLAP Operations
• 4 types of OLAP Operations;
– Aggregation
– Pivoting
– Slicing & Dicing
– Roll-Up & Drill-Down

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 74


OLAP Operations : Aggregation

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 75


OLAP Operations : Pivoting

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 76


OLAP Operations : Pivoting

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 77


OLAP Operations : Slicing

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 78


OLAP Operations : Slicing

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 79


OLAP Operations : Dicing

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 80


OLAP Operations : Dicing

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 81


OLAP Operations : Roll-Up & Drill-Down

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 82


OLAP Operations : Roll-Up & Drill-Down

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 83


OLAP Operations

Reference :
• Example of OLAP Operations -
pangsida.sakaeo.buu.ac.th/~52410017
/.../Ch06OLAP_Cubes.pptx

Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 84

You might also like