You are on page 1of 41

Data Preprocessing

■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization
August 26, 2020 Data Mining: Data Preprocessing 1
Data Reduction
■ Data reduction
⚪ Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
■ Data reduction strategies
⚪ Dimensionality reduction
⚪ Numerosity reduction
⚪ Concept hierarchy generation
⚪ Discretization
August 26, 2020 Data Mining: Data Preprocessing 2
Dimensionality Reduction

■ Feature selection (i.e., attribute subset


selection):
⚪ Reduce # of patterns in the patterns,
⚪ Remove features with missing values
⚪ Remove features with low variance
⚪ Remove highly correlated features
⚪ Univariate feature selection
■ Heuristic methods:
⚪ Decision-Tree induction
⚪ PCA
August 26, 2020 Data Mining: Data Preprocessing 3
Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


August 26, 2020 Data Mining: Data Preprocessing 4
Data Compression
■ String compression
⚪ There are extensive theories and well-tuned
algorithms
⚪ Typically lossless
⚪ But only limited manipulation is possible without
expansion
■ Audio/video compression
⚪ Typically lossy compression, with progressive
refinement
⚪ Sometimes small fragments of signal can be
reconstructed without reconstructing the whole

August 26, 2020 Data Mining: Data Preprocessing 5


Data Compression

Compresse
Original Data d
Data
lossless

s y
los
Original Data
Approximated

August 26, 2020 Data Mining: Data Preprocessing 6


Numerosity Reduction

■ Parametric methods
⚪ Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
⚪ Example: Regression

■ Non-parametric methods
⚪ Do not assume models
⚪ Major families: histograms, clustering, sampling
August 26, 2020 Data Mining: Data Preprocessing 7
Histograms

■ A popular data reduction


technique
■ Divide data into buckets
and store average or
sum for each bucket / bin.

August 26, 2020 Data Mining: Data Preprocessing 8


Histograms
Type of Bucket:
■ Singleton: Each bucket represents one
price-value/frequency pair.
■ Equiwidth: The width of each bucket range is
uniform.
■ Equidepth: The buckets are created so that,
roughly, the frequency of each bucket is
constant.
■ MaxDiff: A bucket boundary is established
between each pair for pairs having the β - 1
largest differences, where β is user specified.
■ Etc.
August 26, 2020 Data Mining: Data Preprocessing 9
Histograms

Singleton Buckets Equi-width Buckets

August 26, 2020 Data Mining: Data Preprocessing 10


Histograms

Example:
The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been
sorted.
1, 1, 4, 4, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22,
22, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30

How are the buckets determined and the attribute values partitioned?
By:
a. Equi-width (Equal Width) Histogram, Number of buckets = 5
b. Equi-depth (Equal Depth) Histogram, Number of buckets = 5
c. MaxDiff Histogram, β = 3

August 26, 2020 Data Mining: Data Preprocessing 11


Sampling
■ Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data
■ Choose a representative subset of the data
⚪ Simple Random Sampling, may have very poor
performance.
■ Develop adaptive sampling methods
⚪ Stratified Sampling
■ Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
August 26, 2020 Data Mining: Data Preprocessing 12
Simple Random Sampling

WO R
SRS domran
p le t
(sim le withou
samp ment)
ep la ce
r

SRSW
(s R
imple
Samp random
le
Repla with
cemen
t)
Raw Data
August 26, 2020 Data Mining: Data Preprocessing 13
Stratified Sampling

August 26, 2020 Data Mining: Data Preprocessing 14


Clustering

■ Partition data set into clusters, and one can store


cluster representation only
■ Can be very effective if data is clustered but not if
data is “smeared”
■ Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms.
August 26, 2020 Data Mining: Data Preprocessing 15
Cluster Sampling
Cluster Sampling
Raw Data Stratified Sample

August 26, 2020 Data Mining: Data Preprocessing 16


Hierarchical Reduction
■ Hierarchical clustering is often
performed.
■ Parametric methods are usually not
amenable to hierarchical
representation
■ Hierarchical aggregation
⚪ An index tree hierarchically divides a data set
into partitions by value range of some attributes.
⚪ Each partition can be considered as a bucket.
August 26, 2020 Data Mining: Data Preprocessing 17
Hierarchical Reduction

A concept hierarchy for the attribute price.

August 26, 2020 Data Mining: Data Preprocessing 18


Data Preprocessing

■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization

August 26, 2020 Data Mining: Data Preprocessing 19


Concept hierachy and
Discretization
■ Concept hierarchies
⚪ reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).
■ Discretization
⚪ reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can then be
used to replace actual data values.
August 26, 2020 Data Mining: Data Preprocessing 20
Discretization and concept hierarchy
generation for numeric data
■ Binning

■ Histogram analysis

■ Clustering analysis

■ Concept hierarchy generation

■ Entropy-based discretization

■ Segmentation by natural partitioning


August 26, 2020 Data Mining: Data Preprocessing 21
Specification of a set of
attributes
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute in
the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
hierarchy.
country 15 distinct values

Province or state 65 distinct values

city 3567 distinct values

street 674,339 distinct values


August 26, 2020 Data Mining: Data Preprocessing 22
Concept hierarchy generation for
categorical data (Example)
A concept hierarchy for the location.

August 26, 2020 Data Mining: Data Preprocessing 23


Concept hierarchy generation for
categorical data (Example)
A concept hierarchy for the location, based on language.

August 26, 2020 Data Mining: Data Preprocessing 24


Discretization

■ Four types of attributes:


⚪ Nominal
⚪ Ordinal
⚪ Interval
⚪ Ratio
■ Discretization:
▪ divide the range of a continuous attribute into
intervals
■ Some classification algorithms only accept categorical
attributes.
⚪ Reduce data size
⚪ Prepare for further analysis

August 26, 2020 Data Mining: Data Preprocessing 25


Four types of attributes:

August 26, 2020 Data Mining: Data Preprocessing 26


Attributes Types
■ Nominal:
⚪ This scale is made up of the list of possible values that a variable
may take.
⚪ The order of these values has no meaning.
■ Ordinal:
⚪ This scale describes a variable whose values are ordered.
⚪ the difference between the values does not describe the magnitude
of the actual difference.
■ Interval:
⚪ Scales that describe values where the interval between the values
has meaning.
■ Ratio:
⚪ Scales that describe variables where the same difference between
values has the same meaning as in interval, but where a double,
tripling, etc. of the values implies a double, tripling, etc. of the
measurement.

August 26, 2020 Data Mining: Data Preprocessing 27


Types of Attributes
■ Binary: true / false, yes/no, +/-, etc.
■ Nominal: ID number, eye color, zip codes, etc.
■ Ordinal: rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in {tall,
medium, short}, etc.
■ Interval: calendar dates, temperatures in
Celsius or Fahrenheit, age, etc. (e.g.
Fahrenheit scale: 5oF, 10oF, 15oF. 10oF is not
twice as hot as 5oF.)
■ Ratio: Bank account ratio, tax ratio, etc. (e.g.
Bank ratio: $5, $10, $15)
August 26, 2020 Data Mining: Data Preprocessing 28
Attributes Types

August 26, 2020 Data Mining: Data Preprocessing 29


Concept hierarchy generation for
categorical data

■ Specification of a partial ordering of


attributes explicitly at the schema level
by users or experts
■ Specification of a portion of a hierarchy
by explicit data grouping
■ Specification of only a partial set of
attributes
August 26, 2020 Data Mining: Data Preprocessing 30
Methods for splitting the records
■ Depends on attribute types
⚪ Binary: true / false, yes/no, +/-, etc.
⚪ Nominal: ID number, eye color, zip codes, etc.
⚪ Ordinal: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}, etc.
⚪ Continuous/Ratio: calendar dates, temperatures
in Celsius or Fahrenheit, age, etc.
■ Depends on number of ways to split
⚪ 2-way split (Binary split)
⚪ Multi-way split
26-Aug-20 Data Mining: Classification 31
Splitting based on Nominal
attributes

■ Each partition has


subset of values
signifying it
⚪ Multi-way split: Use
as many partitions as
distinct values.
⚪ Binary split: Divides
values in to two
subsets. Need to find
optimal partitioning.
26-Aug-20 Data Mining: Classification 32
Splitting based on Ordinal
attributes
■ Multi-way split:
⚪ Use as many
partitions as
distinct values
■ Binary split:
⚪ Divides values into
two subsets
⚪ Need to find
optimal partitioning
⚪ Preserve order
property among
attribute values
26-Aug-20 Data Mining: Classification 33
Splitting based on Continuous
attributes
■ Different ways of handling
⚪ Discretization to form an ordinal categorical attribute
■ Static – discretize once at the beginning
■ Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), clustering, ect.
⚪ Binary Decision: (A < v) or (A ≥ v)
■ Consider all possible splits and finds the best cut
■ Can be more compute intensive

26-Aug-20 Data Mining: Classification 34


Segmentation by natural
partitioning

3-4-5 Rule can be used to segment numeric data into


relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into three
equal-width intervals for 3, 6, 9, and three intervals
in the grouping of 2-3-2 for 7
* If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into four intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into five intervals
August 26, 2020 Data Mining: Data Preprocessing 35
Example of 3-4-5 rule
■ Suppose that profits at different branches of AllElectronics for the
year 1997 cover a wide range, from -$351,976.00 to
$4,700,896.50.
■ A user wishes to have a concept hierarchy for profit automatically
generated.
⚪ For improved readability, we use the notation ( l… r ] to represent
the interval (l… r]. For example, (-$1,000,000… $0] denotes the
range from -$1,000,000 (exclusive) to $0 (inclusive).
■ Suppose that the data within the 5%-tile and 95%-tile are
between -$159,876 and $1,838,761. The results of applying the
3-4-5 rule are shown in the next slide

August 26, 2020 Data Mining: Data Preprocessing 36


Example of 3-4-5 rule (5 Steps)
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%- tile) Max

Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
-$300) $1,200) ($2,000 -
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
Step 5:
($400 - ($1,400 - $4,000)
(-$200 - $1,600)
$600) ($4,000 -
-$100)
($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
(-$100 - $1,800)
$1,000) $2,000)
0)
Example of 3-4-5 rule
Step 1: Based on the information, the minimum and maximum values are: MIN
= -$351,976.00, and MAX = $4,700,896.50. The low (5%-tile) and
high (95%-tile) values to be considered for the top or first level of
segmentation are: LOW = -$159,876.00 and HIGH = $1,838,761.00
count

-$351 -$159 profit $1,838 $4,700


Step 1:
Min Low (i.e, 5%-tile) High(i.e, 95%-tile) Max
Step 2: Given LOW and HIGH, the most significant digit is at the million dollar
digit position (i.e., msd = 1,000,000). Rounding LOW down to the
million dollar digit, we get LOW = -$1,000,000 and rounding HIGH
up to the million dollar digit, we get HIGH = +$2,000,000.
Step 2: msd=1,000 Low = -$1,000 High = $2,000
August 26, 2020 Data Mining: Data Preprocessing 38
Example of 3-4-5 rule

Step 3: Since this interval ranges over 3 distinct values at the most significant
digit, i.e., (2,000,000 - (-1,000,000)) / 1,000,000 = 3, the segment is
partitioned into 3 equi-width subsegments according to the 3-4-5 rule:
(-$1,000,000…$0], ($0…$1,000,000], and ($1,000,000…$2,000,000].
This represents the top tier of the hierarchy.

Step 3: (-$1,000 - $2,000]

(-$1,000 – 0] (0 - $ 1,000] ($1,000 - $2,000]


August 26, 2020 Data Mining: Data Preprocessing 39
Example of 3-4-5 rule
Step 4: We now examine the MIN and MAX values to see how they “fit" into the
first level partitions.
• Since the first interval, (-$1,000,000…$0] covers the MIN value, i.e.,
LOW < MIN, we can adjust the left boundary of this interval to make
the interval smaller. The most significant digit (msd) of MIN = 100,000.
Rounding MIN down to this position, we get MIN = -$400,000.
Therefore, the first interval is redefined as (-$400; 000…$0].
• Since the last interval, ($1,000,000…$2,000,000] does not cover the
MAX value, i.e., MAX > HIGH, we need to create a new interval to
cover it. Rounding up MAX at its most significant digit position, the new
interval is ($2,000,000…$5,000,000].
Hence, the top most level of the hierarchy contains four partitions,
(-$400,000…$0], ($0…$1,000,000], ($1,000,000…$2,000,000],
($2,000,000…$5,000,000]

(-$400 -$5,000]
(-$1,000 - $2,000]
Step 4:

(-$400 - $0] (-$1,000


($0 – 0] (0 - $ 1,000]
- $1,000] ($1,000
($1,000 - 000]
- $2, $2,000] ($2,000 - $5, 000]
Example of 3-4-5 rule
Step 5: Recursively, each interval can be further partitioned according to the 3-4-5 rule to form the next lower
level of the hierarchy:
• The first interval (-$400,000…$0] is partitioned into 4 sub-intervals: (-$400,000…$300,000],
(-$300,000…-$200,000], (-$200,000…-$100,000], and (-$100,000…$0].
• The second interval, ($0…$1,000,000], is partitioned into 5 sub-intervals: ($0…$200,000],
($200,000… $400,000], ($400,000…$600,000], ($600,000…$800,000], and ($800,000…$1,000,000].
• The third interval, ($1,000,000…$2,000,000], is partitioned into 5 sub-intervals: ($1,000,000…
$1,200,000], ($1,200,000…$1,400,000], ($1,400,000…$1,600,000], ($1,600,000…$1,800,000], and
($1,800,000…$2,000,000].
• The last interval, ($2,000,000…$5,000,000], is partitioned into 3 sub-intervals: ($2,000,000…
$3,000,000], ($3,000,000…$4,000,000], and ($4,000,000…$5,000,000].

(-$4000 -$5,000]

Step 5: (-$400 - $0] ($0 - $1,000] ($1,000 - $2, 000] ($2,000 - $5, 000]
($0 - ($1,000 -
(-$400 - $1,200]
$200] ($2,000 -
-$300]
($200 - ($1,200 - $3,000]
(-$300 - $1,400]
-$200] $400]
($3,000 -
(-$200 - ($400 - ($1,400 - $4,000] ($4,000 -
-$100] $600] $1,600] $5,000]
($600 - ($800 - ($1,600 - ($1,800 -
(-$100 - $800] $1,000] $1,800] $2,000]
$0]
Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, as necessary.

You might also like