You are on page 1of 53

ITS 665 Data Mining

Topic 2
Understanding your Data

Shuzlina Abdul Rahman


(shuzlina@fskm.uitm.edu.my)

Centre of Information Systems Studies


Faculty of Computer and Mathematical Sciences, UiTM
Source: Adapted Jiawei Han and Micheline Kamber (2012); Tan et al (2012)
Objectives

To differentiate Data Objects and Attribute Types

To understand Basic Statistical Descriptions of

Data

To explain several types of Data Visualization


Types of Data Sets

Record Ordered
Relational records Video data: sequence
Data matrix, e.g., numerical
of images
matrix, crosstabs
Document data: text Temporal data: time-
documents: term-frequency series
vector
Sequential Data:
Transaction data
transaction
Graph and network
sequences
World Wide Web
Social or information Genetic sequence
networks data
Molecular Structures Spatial, image and
multimedia:
3

Record Data

Data that consists of a collection of records, each


of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as points
in a multi-dimensional space, where each dimension
represents a distinct attribute

Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute

P r o j e c t i o n P r o j e c t i o n D i s t a n c e L o a d T h i c k n e s s

o f x L o a d o f y l o a d

1 0 . 2 3 5 . 2 7 1 5 . 2 2 2 . 7 1 . 2

1 2 . 6 5 6 . 2 5 1 6 . 2 2 2 . 2 1 . 1
Document Data

Each document becomes a `term' vector,


each term is a component (attribute) of the vector
the value of each component is the number of times the
corresponding term occurs in the document.
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of
products purchased by a customer during one shopping
trip constitute a transaction, while the individual
products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa"> 2
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
5 1
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers 5
Other Types of Data

Ordered Data
Sequences of transactions Genomic sequence data
Items/Events
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
An element of the sequence
Data Objects

Data sets are made up of data objects.


A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data
points, objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns
->attributes.

10
Attributes

Attribute (or dimensions, features,


variables): a data field, representing a
characteristic or feature of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled

Ratio-scaled

11
Attribute Types

Nominal: categories, states, or names of


things
Hair_color = {black, brown, blond, red, auburn, grey,
white}
We can assign a code of 0 for black, 1 for brown
marital status, occupation, ID numbers, zip codes

Nominal attribute values do not have any meaningful order


about them and are not quantitative
It makes no sense to find the mean (average) value or
median (middle) value for such an attribute, given a set of
objects.
Except the attributes most commonly occurring value.

12
Attribute Types

Binary (Boolean true or false)


Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally
important
e.g., gender

Asymmetric binary: outcomes not equally


important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important
outcome (e.g., HIV positive)

13
Attribute Types

Ordinal
Values have a meaningful order (ranking) but
magnitude between successive values is not
known.
Size = {small, medium, large}, grades, army
rankings
Grade (e.g., A+, A, A-, B+, B, B-, C+, C, C-,
D+, D, E, F)

Note that nominal, binary, and ordinal attributes


are qualitative.
Describe a feature of an object, without giving an
actual size or quantity.
14
Numeric Attribute Types

Numeric attribute: a measurable quantity


(represented in integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in Cor F, calendar
dates
No true zero-point, neither 0C nor 0F
indicates no temperature.
We can compute their mean value, in addition to

the median and mode measures of central


tendency.

15
Numeric Attribute Types

Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K is twice as high as 5
K).
e.g., temperature in Kelvin, length,
counts,
monetary quantities (e.g., you are 100 times
richer with $100 than with $1).

16
Discrete vs. Continuous Attributes

Discrete Attribute
Has only a finite or countable infinite set of

values
E.g., zip codes, profession, or the set of

words in a collection of documents


Sometimes, represented as integer variables

Note: Binary attributes are a special case of

discrete attributes
Note that discrete attributes may have numeric
values, such as 0 and 1 for binary attributes, or,
the values 0 to 110 for the attribute Age.

17
Discrete vs. Continuous Attributes

Continuous Attribute
Has real numbers as attribute values

E.g., temperature, height, or weight

Practically, real values can only be measured

and represented using a finite number of


digits
Are real numbers, whereas numeric values

can be either integers or real numbers.


Continuous attributes are typically

represented as floating-point variables

18
Properties of Attribute Values

The type of an attribute depends on which of


the following properties it possesses:
Distinctness: =
Order: < >
Addition: + -
Multiplication: */

Nominal attribute: distinctness


Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one
object from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests

Interval For interval attributes, the calendar dates, mean, standard


differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current

20
Examples

Source:
http://www.perceptualedge.com/articles/dmreview/qua
nt_vs_cat_data.pdf
BASIC STATISTICAL DESCRIPTIONS
OF DATA
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size
1 n
x xi
x
n i 1 N
Weighted arithmetic mean: n

Trimmed mean: chopping extreme values w x i i


x i 1
n
Median: A holistic measure w
i 1
i

Middle value if odd number of values, or average of the middle two


values otherwise
Estimated by interpolation (for grouped data):
n / 2 ( f )l
Mode median L1 ( )c
f median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
For unimodal frequency that are moderately skewed; the formula

mean mode 3 (mean median)


23
Symmetric vs. Skewed Data

Median, mean and mode of


symmetric, positively and
negatively skewed data
symmetric

positively skewed
negatively skewed

24
Measuring the Dispersion of Data

Quartiles, outliers and boxplots


Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually 1 n
1 n
2 ( xi ) 2 x 2
2
i
Outlier: usually, a value higher/lower than 1.5 x IQR N i 1 N i 1

Variance and standard deviation (sample: s, population: )


Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n
s
2

n 1 i 1
( xi x )
2
[ xi ( xi ) 2 ]
n 1 i 1 n i 1

Standard deviation s (or ) is the square root of variance s2 (or 2)

25
Properties of Normal Distribution Curve

The normal (distribution) curve


From to +: contains about 68% of the

measurements (: mean, : standard deviation)


From 2 to +2: contains about 95% of it
From 3 to +3: contains about 99.7% of it

26
Boxplot Analysis

Five-number summary of a distribution:


Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
The median is marked by a line within
the box
Whiskers: two lines outside the box
extend to Minimum and Maximum

27
Visualization of Data Dispersion:
Boxplot Analysis

28
Histogram Analysis

Histogram: Graph display of


tabulated frequencies, shown as
bars
It shows what proportion of cases
fall into each of several categories
Differs from a bar chart in that it is
the area of the bar that denotes
the value, not the height as in bar
charts, a crucial distinction when
the categories are not of uniform
width
The categories are usually
specified as non-overlapping
intervals of some variable. The
categories (bars) must be adjacent

29
Histograms Often Tell More than Boxplots

The two histograms shown


in the left may have the
same boxplot
representation
The same values for:
min, Q1, median, Q3,
max
But they have rather
different data distributions

30
Quantile Plot
Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi indicates
that approximately 100 fi% of the data are below or
equal to the value xi

31
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
View: Is there is a shift in going from one distribution to
another?
Example shows unit price of items sold at Branch 1 vs.
Branch 2 for each quantile. Unit prices of items sold at
Branch 1 tend to be lower than those at Branch 2.

32
Scatter plot
Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

33
Positively and Negatively Correlated Data

34
Positively and Negatively Correlated Data

If the pattern of plotted points slopes from lower left If the pattern of plotted points slopes from upper left to
to upper right, this means that the values of X lower right, then the values of X increase as the values
increase as the values of Y increase, which of Y decrease, suggesting a negative correlation .
suggests a positive correlation.

The left half fragment is positively correlated


The right half is negative correlated

35
Uncorrelated Data

36
Exercise 1

What is the median


value?
What are the lower and
upper values?
What are the outlier
values?
Interpret the box-and-
whisker plot.
SEVERAL TYPES OF DATA
VISUALIZATION
Data Visualization

Why data visualization?


Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships
among data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived

Typical visualization methods:


Geometric techniques
Icon-based techniques
Hierarchical techniques

39
Geometric Techniques

Visualization of geometric transformations and


projections of the data
Methods
Direct data visualization
Scatterplot matrices
Landscapes
Projection pursuit technique
Finding meaningful projections of
multidimensional data
Prosection views
Hyperslice
Parallel coordinates

40
Scatterplot Matrices

Used by ermission of M. Ward, Worcester Polytechnic Institute

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of


(k2/2-k) scatterplots]
41
Landscapes

news articles
Used by permission of B. Wright, Visible Decisions Inc.

visualized as
a landscape

Visualization of the data as perspective landscape


The data needs to be transformed into a (possibly artificial)
2D spatial representation which preserves the characteristics
of the data 42
Icon-based Techniques

Visualization of the data values as features of


icons
Typical visualization methods:
Chernoff Faces

Stick Figures

General techniques
Shape Coding: Use shape to represent certain

information encoding
Color Icons: Using color icons to encode more

information
TileBars: The use of small icons representing the

relevance feature vectors in document retrieval

43
44

Chernoff Faces
A way to display variables on a two-dimensional surface,
e.g., let x be eyebrow slant, y be eye size, z be nose length,
etc.
The figure shows faces produced using 10 characteristics--
head eccentricity, eye size, eye spacing, eye eccentricity,
pupil size, eyebrow slant, nose size, mouth shape, mouth
size, and mouth opening): Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)

REFERENCE: Gonick, L. and Smith, W.


The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
Hierarchical Techniques

Visualization of the data using a


hierarchical partitioning into subspaces.
Methods
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Cone Trees
InfoCube

45
46

Dimensional Stacking

attr ib u te4
attr ib u te2

a ttr ib u te3

a ttri b u te 1
Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are stacked into each other
Partitioning of the attribute value ranges into classes.
The important attributes should be used on the outer
levels.
Adequate for data with ordinal attributes of low cardinality
But, difficult to display more than nine dimensions
Important to map dimensions appropriately
Dimensional Stacking

Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-
axes and ore grade and depth mapped to the inner x-, y-axes
47
Tree-Map
Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending on the
attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)

MSR Netscan Image

48
Tree-Map of a File System
(Schneiderman)

49
Three-D Cone Trees

3D cone tree visualization technique


works well for up to a thousand nodes
or so
First build a 2D circle tree that
arranges its nodes in concentric circles
centered on the root node
Cannot avoid overlaps when projected
to 2D
G. Robertson, J. Mackinlay, S. Card.
Cone Trees: Animated 3D
Visualizations of Hierarchical
Information, ACM SIGCHI'91
Graph from Nadeau Software
Consulting website: Visualize a social
network data set that models the way
an infection spreads from one person to
the next

50
InfoCube
A 3-D visualization technique where hierarchical
information is displayed as nested semi-
transparent cubes
The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smmaller cubes inside the
outermost cubes, and so on

51
Source of Public Datasets

UC Irvine Machine Learning Repository


http://archive.ics.uci.edu/ml/
Datasets for Data Mining The University of Edinburgh
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.ht
ml
Google Public Data
http://www.google.com/publicdata/directory
Kent Ridge Bio-medical Dataset
http://datam.i2r.a-star.edu.sg/datasets/krbd/
Frequent Itemset Mining Dataset Repository
http://fimi.ua.ac.be/data/
Bioinformatics Datasets
http://www.kent.ac.uk/library/subjects/biosciences/bioinformatics.
html?tab=genomes
Summary
Data attribute types: nominal, binary, ordinal,
interval-scaled, ratio-scaled
Many types of data sets, e.g., numerical, text,
graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency,
dispersion, graphical displays
Data visualization: map data onto graphical
primitives
Measure data similarity
Above steps are the beginning of data
preprocessing.

53