Data Preprocessing: Data Cleaning Data Integration and Transformation

Data Preprocessing
■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization
August 26, 2020 Data Mining: Data Preprocessing 1
Data Reduction
■ Data reduction
⚪ Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
■ Data reduction strategies
⚪ Dimensionality reduction
⚪ Numerosity reduction
⚪ Concept hierarchy generation
⚪ Discretization
Dimensionality Reduction
■ Feature selection (i.e., attribute subset

selection):
⚪ Reduce # of patterns in the patterns,
⚪ Remove features with missing values
⚪ Remove features with low variance
⚪ Remove highly correlated features
⚪ Univariate feature selection
■ Heuristic methods:
⚪ Decision-Tree induction
⚪ PCA
Example of Decision Tree Induction
Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}

Data Compression
■ String compression
⚪ There are extensive theories and well-tuned
algorithms
⚪ Typically lossless
⚪ But only limited manipulation is possible without
expansion
■ Audio/video compression
⚪ Typically lossy compression, with progressive
refinement
⚪ Sometimes small fragments of signal can be
reconstructed without reconstructing the whole

Data Compression
Compresse
Original Data d
Data
lossless
s y
los
Original Data
Approximated

Numerosity Reduction
■ Parametric methods
⚪ Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
⚪ Example: Regression
■ Non-parametric methods
⚪ Do not assume models
⚪ Major families: histograms, clustering, sampling
Histograms
■ A popular data reduction

technique
■ Divide data into buckets
and store average or
sum for each bucket / bin.

Histograms
Type of Bucket:
■ Singleton: Each bucket represents one
price-value/frequency pair.
■ Equiwidth: The width of each bucket range is
uniform.
■ Equidepth: The buckets are created so that,
roughly, the frequency of each bucket is
constant.
■ MaxDiff: A bucket boundary is established
between each pair for pairs having the β - 1
largest differences, where β is user specified.
■ Etc.
Histograms
Singleton Buckets Equi-width Buckets

Histograms
Example:
The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been
sorted.
1, 1, 4, 4, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22,
22, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
How are the buckets determined and the attribute values partitioned?
By:
a. Equi-width (Equal Width) Histogram, Number of buckets = 5
b. Equi-depth (Equal Depth) Histogram, Number of buckets = 5
c. MaxDiff Histogram, β = 3

Sampling
■ Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data
■ Choose a representative subset of the data
⚪ Simple Random Sampling, may have very poor
performance.
■ Develop adaptive sampling methods
⚪ Stratified Sampling
■ Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
Simple Random Sampling
WO R
SRS domran
p le t
(sim le withou
samp ment)
ep la ce
r
SRSW
(s R
imple
Samp random
le
Repla with
cemen
t)
Raw Data
Stratified Sampling

Clustering
■ Partition data set into clusters, and one can store

cluster representation only
■ Can be very effective if data is clustered but not if
data is “smeared”
■ Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms.
Cluster Sampling
Cluster Sampling
Raw Data Stratified Sample

Hierarchical Reduction
■ Hierarchical clustering is often
performed.
■ Parametric methods are usually not
amenable to hierarchical
representation
■ Hierarchical aggregation
⚪ An index tree hierarchically divides a data set
into partitions by value range of some attributes.
⚪ Each partition can be considered as a bucket.
Hierarchical Reduction
A concept hierarchy for the attribute price.

Data Preprocessing
■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization

Concept hierachy and
Discretization
■ Concept hierarchies
⚪ reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).
■ Discretization
⚪ reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can then be
used to replace actual data values.
Discretization and concept hierarchy
generation for numeric data
■ Binning
■ Histogram analysis
■ Clustering analysis
■ Concept hierarchy generation
■ Entropy-based discretization
■ Segmentation by natural partitioning

Specification of a set of
attributes
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute in
the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
hierarchy.
country 15 distinct values
Province or state 65 distinct values
city 3567 distinct values
street 674,339 distinct values

Concept hierarchy generation for
categorical data (Example)
A concept hierarchy for the location.

categorical data (Example)
A concept hierarchy for the location, based on language.

Discretization
■ Four types of attributes:

⚪ Nominal
⚪ Ordinal
⚪ Interval
⚪ Ratio
■ Discretization:
▪ divide the range of a continuous attribute into
intervals
■ Some classification algorithms only accept categorical
attributes.
⚪ Reduce data size
⚪ Prepare for further analysis

Four types of attributes:

Attributes Types
■ Nominal:
⚪ This scale is made up of the list of possible values that a variable
may take.
⚪ The order of these values has no meaning.
■ Ordinal:
⚪ This scale describes a variable whose values are ordered.
⚪ the difference between the values does not describe the magnitude
of the actual difference.
■ Interval:
⚪ Scales that describe values where the interval between the values
has meaning.
■ Ratio:
⚪ Scales that describe variables where the same difference between
values has the same meaning as in interval, but where a double,
tripling, etc. of the values implies a double, tripling, etc. of the
measurement.

Types of Attributes
■ Binary: true / false, yes/no, +/-, etc.
■ Nominal: ID number, eye color, zip codes, etc.
■ Ordinal: rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in {tall,
medium, short}, etc.
■ Interval: calendar dates, temperatures in
Celsius or Fahrenheit, age, etc. (e.g.
Fahrenheit scale: 5oF, 10oF, 15oF. 10oF is not
twice as hot as 5oF.)
■ Ratio: Bank account ratio, tax ratio, etc. (e.g.
Bank ratio: $5, $10, $15)
Attributes Types

categorical data
■ Specification of a partial ordering of

attributes explicitly at the schema level
by users or experts
■ Specification of a portion of a hierarchy
by explicit data grouping
■ Specification of only a partial set of
attributes
Methods for splitting the records
■ Depends on attribute types
⚪ Binary: true / false, yes/no, +/-, etc.
⚪ Nominal: ID number, eye color, zip codes, etc.
⚪ Ordinal: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}, etc.
⚪ Continuous/Ratio: calendar dates, temperatures
in Celsius or Fahrenheit, age, etc.
■ Depends on number of ways to split
⚪ 2-way split (Binary split)
⚪ Multi-way split
26-Aug-20 Data Mining: Classification 31
Splitting based on Nominal
attributes
■ Each partition has

subset of values
signifying it
⚪ Multi-way split: Use
as many partitions as
distinct values.
⚪ Binary split: Divides
values in to two
subsets. Need to find
optimal partitioning.
Splitting based on Ordinal
attributes
■ Multi-way split:
⚪ Use as many
partitions as
distinct values
■ Binary split:
⚪ Divides values into
two subsets
⚪ Need to find
optimal partitioning
⚪ Preserve order
property among
attribute values
Splitting based on Continuous
attributes
■ Different ways of handling
⚪ Discretization to form an ordinal categorical attribute
■ Static – discretize once at the beginning
■ Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), clustering, ect.
⚪ Binary Decision: (A < v) or (A ≥ v)
■ Consider all possible splits and finds the best cut
■ Can be more compute intensive

Segmentation by natural
partitioning
3-4-5 Rule can be used to segment numeric data into

relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into three
equal-width intervals for 3, 6, 9, and three intervals
in the grouping of 2-3-2 for 7
* If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into four intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into five intervals
Example of 3-4-5 rule
■ Suppose that profits at different branches of AllElectronics for the
year 1997 cover a wide range, from -$351,976.00 to
$4,700,896.50.
■ A user wishes to have a concept hierarchy for profit automatically
generated.
⚪ For improved readability, we use the notation ( l… r ] to represent
the interval (l… r]. For example, (-$1,000,000… $0] denotes the
range from -$1,000,000 (exclusive) to $0 (inclusive).
■ Suppose that the data within the 5%-tile and 95%-tile are
between -$159,876 and $1,838,761. The results of applying the
3-4-5 rule are shown in the next slide

Example of 3-4-5 rule (5 Steps)
count
Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%- tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$4000 -$5,000)
Step 4:
(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
-$300) $1,200) ($2,000 -
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
Step 5:
($400 - ($1,400 - $4,000)
(-$200 - $1,600)
$600) ($4,000 -
-$100)
($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
(-$100 - $1,800)
$1,000) $2,000)
0)
Step 1: Based on the information, the minimum and maximum values are: MIN
= -$351,976.00, and MAX = $4,700,896.50. The low (5%-tile) and
high (95%-tile) values to be considered for the top or first level of
segmentation are: LOW = -$159,876.00 and HIGH = $1,838,761.00
count
-$351 -$159 profit $1,838 $4,700

Step 1:
Min Low (i.e, 5%-tile) High(i.e, 95%-tile) Max
Step 2: Given LOW and HIGH, the most significant digit is at the million dollar
digit position (i.e., msd = 1,000,000). Rounding LOW down to the
million dollar digit, we get LOW = -$1,000,000 and rounding HIGH
up to the million dollar digit, we get HIGH = +$2,000,000.
Step 2: msd=1,000 Low = -$1,000 High = $2,000
Step 3: Since this interval ranges over 3 distinct values at the most significant
digit, i.e., (2,000,000 - (-1,000,000)) / 1,000,000 = 3, the segment is
partitioned into 3 equi-width subsegments according to the 3-4-5 rule:
(-$1,000,000…$0], ($0…$1,000,000], and ($1,000,000…$2,000,000].
This represents the top tier of the hierarchy.
Step 3: (-$1,000 - $2,000]
(-$1,000 – 0] (0 - $ 1,000] ($1,000 - $2,000]

Step 4: We now examine the MIN and MAX values to see how they “fit" into the
first level partitions.
• Since the first interval, (-$1,000,000…$0] covers the MIN value, i.e.,
LOW < MIN, we can adjust the left boundary of this interval to make
the interval smaller. The most significant digit (msd) of MIN = 100,000.
Rounding MIN down to this position, we get MIN = -$400,000.
Therefore, the first interval is redefined as (-$400; 000…$0].
• Since the last interval, ($1,000,000…$2,000,000] does not cover the
MAX value, i.e., MAX > HIGH, we need to create a new interval to
cover it. Rounding up MAX at its most significant digit position, the new
interval is ($2,000,000…$5,000,000].
Hence, the top most level of the hierarchy contains four partitions,
(-$400,000…$0], ($0…$1,000,000], ($1,000,000…$2,000,000],
($2,000,000…$5,000,000]
(-$400 -$5,000]
(-$1,000 - $2,000]
Step 4:
(-$400 - $0] (-$1,000

($0 – 0] (0 - $ 1,000]
- $1,000] ($1,000
($1,000 - 000]
- $2, $2,000] ($2,000 - $5, 000]
Step 5: Recursively, each interval can be further partitioned according to the 3-4-5 rule to form the next lower
level of the hierarchy:
• The first interval (-$400,000…$0] is partitioned into 4 sub-intervals: (-$400,000…$300,000],
(-$300,000…-$200,000], (-$200,000…-$100,000], and (-$100,000…$0].
• The second interval, ($0…$1,000,000], is partitioned into 5 sub-intervals: ($0…$200,000],
($200,000… $400,000], ($400,000…$600,000], ($600,000…$800,000], and ($800,000…$1,000,000].
• The third interval, ($1,000,000…$2,000,000], is partitioned into 5 sub-intervals: ($1,000,000…
$1,200,000], ($1,200,000…$1,400,000], ($1,400,000…$1,600,000], ($1,600,000…$1,800,000], and
($1,800,000…$2,000,000].
• The last interval, ($2,000,000…$5,000,000], is partitioned into 3 sub-intervals: ($2,000,000…
$3,000,000], ($3,000,000…$4,000,000], and ($4,000,000…$5,000,000].
(-$4000 -$5,000]
Step 5: (-$400 - $0] ($0 - $1,000] ($1,000 - $2, 000] ($2,000 - $5, 000]
($0 - ($1,000 -
(-$400 - $1,200]
$200] ($2,000 -
-$300]
($200 - ($1,200 - $3,000]
(-$300 - $1,400]
-$200] $400]
($3,000 -
(-$200 - ($400 - ($1,400 - $4,000] ($4,000 -
-$100] $600] $1,600] $5,000]
($600 - ($800 - ($1,600 - ($1,800 -
(-$100 - $800] $1,000] $1,800] $2,000]
$0]
Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, as necessary.

Data Preprocessing: Data Cleaning Data Integration and Transformation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing: Data Cleaning Data Integration and Transformation

Uploaded by

Copyright:

Available Formats

Data Preprocessing

■ Feature selection (i.e., attribute subset

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

August 26, 2020 Data Mining: Data Preprocessing 5

August 26, 2020 Data Mining: Data Preprocessing 6

■ A popular data reduction

August 26, 2020 Data Mining: Data Preprocessing 8

Singleton Buckets Equi-width Buckets

August 26, 2020 Data Mining: Data Preprocessing 10

August 26, 2020 Data Mining: Data Preprocessing 11

August 26, 2020 Data Mining: Data Preprocessing 14

■ Partition data set into clusters, and one can store

August 26, 2020 Data Mining: Data Preprocessing 16

A concept hierarchy for the attribute price.

August 26, 2020 Data Mining: Data Preprocessing 18

August 26, 2020 Data Mining: Data Preprocessing 19

■ Concept hierarchy generation

■ Segmentation by natural partitioning

Province or state 65 distinct values

city 3567 distinct values

street 674,339 distinct values

August 26, 2020 Data Mining: Data Preprocessing 23

August 26, 2020 Data Mining: Data Preprocessing 24

■ Four types of attributes:

August 26, 2020 Data Mining: Data Preprocessing 25

August 26, 2020 Data Mining: Data Preprocessing 26

August 26, 2020 Data Mining: Data Preprocessing 27

August 26, 2020 Data Mining: Data Preprocessing 29

■ Specification of a partial ordering of

■ Each partition has

26-Aug-20 Data Mining: Classification 34

3-4-5 Rule can be used to segment numeric data into

August 26, 2020 Data Mining: Data Preprocessing 36

Step 1: -$351 -$159 profit $1,838 $4,700

Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0) ($2,000 - $5, 000)

-$351 -$159 profit $1,838 $4,700

Step 3: (-$1,000 - $2,000]

(-$1,000 – 0] (0 - $ 1,000] ($1,000 - $2,000]

(-$400 - $0] (-$1,000

You might also like