You are on page 1of 40

Data Preprocessing

 Data cleaning
 Data integration and transformation
 Data reduction
 Concept Hierarchy Generation
 Discretization
January 21, 2024 Data Mining: Data Preprocessing 1
Data Reduction
 Data reduction
 Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
 Data reduction strategies
 Dimensionality reduction
 Numerosity reduction
 Concept hierarchy generation
 Discretization
January 21, 2024 Data Mining: Data Preprocessing 2
Dimensionality Reduction

 Feature selection (i.e., attribute subset


selection):
 Reduce # of patterns in the patterns,
 Remove features with missing values
 Remove features with low variance
 Remove highly correlated features
 Univariate feature selection
 Heuristic methods:
 Decision-Tree induction
 PCA
January 21, 2024 Data Mining: Data Preprocessing 3
Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


January 21, 2024 Data Mining: Data Preprocessing 4
Data Compression
 String compression
 There are extensive theories and well-tuned
algorithms
 Typically lossless
 But only limited manipulation is possible without
expansion
 Audio/video compression
 Typically lossy compression, with progressive
refinement
 Sometimes small fragments of signal can be
reconstructed without reconstructing the whole

January 21, 2024 Data Mining: Data Preprocessing 5


Data Compression

Original Data Compressed


Data
lossless

s s y
lo
Original Data
Approximated

January 21, 2024 Data Mining: Data Preprocessing 6


Numerosity Reduction

 Parametric methods
 Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
 Example: Regression

 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
January 21, 2024 Data Mining: Data Preprocessing 7
Histograms
 A popular data reduction 40
technique 35
 Divide data into buckets 30
and store average or 25
sum for each bucket / bin.
20
15
10
5
0
10000 30000 50000 70000 90000

January 21, 2024 Data Mining: Data Preprocessing 8


Histograms
Type of Bucket:
 Singleton: Each bucket represents one
price-value/frequency pair.
 Equiwidth: The width of each bucket range
is uniform.
 Equidepth: The buckets are created so that,
roughly, the frequency of each bucket is
constant.
 MaxDiff: A bucket boundary is established
between each pair for pairs having the  - 1
largest differences, where  is user specified.
 Etc.
January 21, 2024 Data Mining: Data Preprocessing 9
Histograms

Singleton Buckets Equi-width Buckets

January 21, 2024 Data Mining: Data Preprocessing 10


Histograms

Example:
The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been
sorted.
1, 1, 4, 4, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15,
15, 15, 18, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 21, 21, 21,
21, 22, 22, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30

How are the buckets determined and the attribute values partitioned?
By:
a. Equi-width (Equal Width) Histogram, Number of buckets = 5
b. Equi-depth (Equal Depth) Histogram, Number of buckets = 5
c. MaxDiff Histogram,  = 3

January 21, 2024 Data Mining: Data Preprocessing 11


Sampling
 Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data
 Choose a representative subset of the data
 Simple Random Sampling, may have very poor
performance.
 Develop adaptive sampling methods
 Stratified Sampling
 Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
January 21, 2024 Data Mining: Data Preprocessing 12
Simple Random Sampling

W O R
SRS le random t
p
(sim le withou
samp ment)
ce
repla

SRSW
(simp R
le
Samp random
le wit
Repla h
cemen
t)
Raw Data
January 21, 2024 Data Mining: Data Preprocessing 13
Stratified Sampling

January 21, 2024 Data Mining: Data Preprocessing 14


Clustering

 Partition data set into clusters, and one can store


cluster representation only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms.
January 21, 2024 Data Mining: Data Preprocessing 15
Cluster Sampling
Cluster Sampling
Raw Data Stratified Sample

January 21, 2024 Data Mining: Data Preprocessing 16


Hierarchical Reduction
 Hierarchical clustering is often
performed.
 Parametric methods are usually not
amenable to hierarchical
representation
 Hierarchical aggregation
 An index tree hierarchically divides a data set
into partitions by value range of some attributes.
 Each partition can be considered as a bucket.
January 21, 2024 Data Mining: Data Preprocessing 17
Hierarchical Reduction

A concept hierarchy for the attribute price.

January 21, 2024 Data Mining: Data Preprocessing 18


Data Preprocessing

 Data cleaning
 Data integration and transformation
 Data reduction
 Concept Hierarchy Generation
 Discretization

January 21, 2024 Data Mining: Data Preprocessing 19


Concept hierachy and
Discretization
 Concept hierarchies
 reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).
 Discretization
 reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can then be
used to replace actual data values.
January 21, 2024 Data Mining: Data Preprocessing 20
Discretization and concept hierarchy
generation for numeric data

 Binning

 Histogram analysis

 Clustering analysis

 Concept hierarchy generation

 Entropy-based discretization

 Segmentation by natural partitioning


January 21, 2024 Data Mining: Data Preprocessing 21
Specification of a set of
attributes
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute in
the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
hierarchy.
country 15 distinct values

Province or state 65 distinct


values
city 3567 distinct values

street 674,339 distinct values


January 21, 2024 Data Mining: Data Preprocessing 22
Concept hierarchy generation for
categorical data (Example)
A concept hierarchy for the location.

January 21, 2024 Data Mining: Data Preprocessing 23


Concept hierarchy generation for
categorical data (Example)
A concept hierarchy for the location, based on language.

January 21, 2024 Data Mining: Data Preprocessing 24


Discretization
 Four types of attributes:
 Nominal
 Ordinal
 Interval
 Ratio
 Discretization:
 divide the range of a continuous attribute into
intervals
 Some classification algorithms only accept categorical
attributes.
 Reduce data size
 Prepare for further analysis

January 21, 2024 Data Mining: Data Preprocessing 25


Four types of attributes:

January 21, 2024 Data Mining: Data Preprocessing 26


Attributes Types
 Nominal:
 This scale is made up of the list of possible values that a variable
may take.
 The order of these values has no meaning.
 Ordinal:
 This scale describes a variable whose values are ordered.
 the difference between the values does not describe the magnitude
of the actual difference.
 Interval:
 Scales that describe values where the interval between the values
has meaning.
 Ratio:
 Scales that describe variables where the same difference between
values has the same meaning as in interval, but where a double,
tripling, etc. of the values implies a double, tripling, etc. of the
measurement.

January 21, 2024 Data Mining: Data Preprocessing 27


Types of Attributes
 Binary: true / false, yes/no, +/-, etc.
 Nominal: ID number, eye color, zip codes, etc.
 Ordinal: rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in {tall,
medium, short}, etc.
 Interval: calendar dates, temperatures in
Celsius or Fahrenheit, age, etc. (e.g.
Fahrenheit scale: 5oF, 10oF, 15oF. 10oF is not
twice as hot as 5oF.)
 Ratio: Bank account ratio, tax ratio, etc. (e.g.
Bank ratio: $5, $10, $15)
January 21, 2024 Data Mining: Data Preprocessing 28
Attributes Types

January 21, 2024 Data Mining: Data Preprocessing 29


Methods for splitting the records
 Depends on attribute types
 Binary: true / false, yes/no, +/-, etc.

 Nominal: ID number, eye color, zip codes, etc.

 Ordinal: rankings (e.g., taste of potato chips on a

scale from 1-10), grades, height in {tall, medium,


short}, etc.
 Continuous/Ratio: calendar dates, temperatures

in Celsius or Fahrenheit, age, etc.


 Depends on number of ways to split
 2-way split (Binary split)

 Multi-way split

Jan 21, 2024 Data Mining: Classification 31


Splitting based on Nominal
attributes

 Each partition has


subset of values
signifying it
 Multi-way split: Use
as many partitions as
distinct values.
 Binary split: Divides
values in to two
subsets. Need to find
optimal partitioning.
Jan 21, 2024 Data Mining: Classification 32
Splitting based on Ordinal
attributes
 Multi-way split:
 Use as many

partitions as
distinct values
 Binary split:
 Divides values into

two subsets
 Need to find

optimal partitioning
 Preserve order

property among
attribute values
Jan 21, 2024 Data Mining: Classification 33
Splitting based on Continuous
attributes
 Different ways of handling
 Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), clustering, ect.
 Binary Decision: (A < v) or (A  v)
 Consider all possible splits and finds the best cut
 Can be more compute intensive

Jan 21, 2024 Data Mining: Classification 34


Segmentation by natural
partitioning

3-4-5 Rule can be used to segment numeric data into


relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into three
equal-width intervals for 3, 6, 9, and three intervals
in the grouping of 2-3-2 for 7
* If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into four intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into five intervals
January 21, 2024 Data Mining: Data Preprocessing 35
Example of 3-4-5 rule
 Suppose that profits at different branches of AllElectronics for the
year 1997 cover a wide range, from -$351,976.00 to
$4,700,896.50.
 A user wishes to have a concept hierarchy for profit
automatically generated.
 For improved readability, we use the notation ( l… r ] to represent
the interval (l… r]. For example, (-$1,000,000… $0] denotes the
range from -$1,000,000 (exclusive) to $0 (inclusive).
 Suppose that the data within the 5%-tile and 95%-tile are
between -$159,876 and $1,838,761. The results of applying the
3-4-5 rule are shown in the next slide

January 21, 2024 Data Mining: Data Preprocessing 36


Example of 3-4-5 rule (5 Steps)
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%- tile) Max

Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
-$300) $1,200) ($2,000 -
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
Step 5:
($400 - ($1,400 - $4,000)
(-$200 - $1,600)
$600) ($4,000 -
-$100)
($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
(-$100 - $1,800)
$1,000) $2,000)
0)
Example of 3-4-5 rule
Step 1: Based on the information, the minimum and maximum values are:
MIN = -$351,976.00, and MAX = $4,700,896.50. The low (5%-tile)
and high (95%-tile) values to be considered for the top or first level of
segmentation are: LOW = -$159,876.00 and HIGH = $1,838,761.00
count

-$351 -$159 profit $1,838 $4,700


Step 1:
Min Low (i.e, 5%-tile) High(i.e, 95%-tile) Max
Step 2: Given LOW and HIGH, the most significant digit is at the million dollar
digit position (i.e., msd = 1,000,000). Rounding LOW down to the
million dollar digit, we get LOW = -$1,000,000 and rounding HIGH
up to the million dollar digit, we get HIGH = +$2,000,000.

Step 2: msd=1,000 Low = -$1,000 High = $2,000


January 21, 2024 Data Mining: Data Preprocessing 38
Example of 3-4-5 rule

Step 3: Since this interval ranges over 3 distinct values at the most significant
digit, i.e., (2,000,000 - (-1,000,000)) / 1,000,000 = 3, the segment is
partitioned into 3 equi-width subsegments according to the 3-4-5 rule:
(-$1,000,000…$0], ($0…$1,000,000], and ($1,000,000…$2,000,000].
This represents the top tier of the hierarchy.

High  Low
Value 
msd

Step 3: (-$1,000 - $2,000]

(-$1,000 – 0] (0 - $ 1,000] ($1,000 - $2,000]


January 21, 2024 Data Mining: Data Preprocessing 39
Example of 3-4-5 rule
Step 4: We now examine the MIN and MAX values to see how they “fit"
into the first level partitions.
• Since the first interval, (-$1,000,000…$0] covers the MIN value, i.e.,
LOW < MIN, we can adjust the left boundary of this interval to make
the interval smaller. The most significant digit (msd) of MIN = 100,000.
Rounding MIN down to this position, we get MIN = -$400,000.
Therefore, the first interval is redefined as (-$400; 000…$0].
• Since the last interval, ($1,000,000…$2,000,000] does not cover the
MAX value, i.e., MAX > HIGH, we need to create a new interval to
cover it. Rounding up MAX at its most significant digit position, the new
interval is ($2,000,000…$5,000,000].
Hence, the top most level of the hierarchy contains four partitions,
(-$400,000…$0], ($0…$1,000,000], ($1,000,000…$2,000,000], ($2,000,000…
$5,000,000]

(-$1,000
(-$400 - - $2,000]
$5,000]
Step 4:

(-$400 - $0] (-$1,000


($0 – 0] (0 - $ 1,000]
- $1,000] ($1,000
($1,000 - 000]
- $2, $2,000] ($2,000 - $5,
000]
Example of 3-4-5 rule
Step 5: Recursively, each interval can be further partitioned according to the 3-4-5 rule to form the next lower
level of the hierarchy:
• The first interval (-$400,000…$0] is partitioned into 4 sub-intervals: (-$400,000…$300,000], (-
$300,000…-$200,000], (-$200,000…-$100,000], and (-$100,000…$0].
• The second interval, ($0…$1,000,000], is partitioned into 5 sub-intervals: ($0…$200,000],
($200,000… $400,000], ($400,000…$600,000], ($600,000…$800,000], and ($800,000…$1,000,000].
• The third interval, ($1,000,000…$2,000,000], is partitioned into 5 sub-intervals: ($1,000,000…
$1,200,000], ($1,200,000…$1,400,000], ($1,400,000…$1,600,000], ($1,600,000…$1,800,000], and
($1,800,000…$2,000,000].
• The last interval, ($2,000,000…$5,000,000], is partitioned into 3 sub-intervals: ($2,000,000…
$3,000,000], ($3,000,000…$4,000,000], and ($4,000,000…$5,000,000].

(-$4000 -$5,000]

Step 5: (-$400 - $0] ($0 - $1,000] ($1,000 - $2, 000] ($2,000 - $5, 000]
($0 - ($1,000 -
(-$400 - $1,200]
$200] ($2,000 -
-$300]
($200 - ($1,200 - $3,000]
(-$300 - $1,400]
-$200] $400]
($3,000 -
(-$200 - ($400 - ($1,400 - $4,000] ($4,000 -
-$100] $600] $1,600] $5,000]
($600 - ($800 - ($1,600 - ($1,800 -
(-$100 - $800] $1,000] $1,800] $2,000]
$0]
Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, as necessary.

You might also like