Data Preprocessing: Istructor Prof. Surabhi Thatte Chapter 2 and Chapter 3 of Han-Kamber 3 Edition

Data Preprocessing
ISTRUCTOR PROF. SURABHI THATTE

Chapter 2 and Chapter 3 of Han-Kamber 3rd edition
Data objects and attributes
❖Data sets are made up of data objects
❖Data object – entity, sample, example, instance, data point,
U tuple, row
W P
❖Attribute – data field, dimension, feature, variable M IT
C S,
❖Observation – observed value of an attributeSo
fo r
i
❖Attribute vector – Feature vectorl e d – a set of attributes used to describe
mp
a given object C o
e n t
n t
Co
4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 2

Types of attribute
❖Nominal = Categorical
❖Relating to names
❖Values are symbols or names of things P U
W
❖Each value represents a category, code, stateMIT
S,
❖No meaningful order o C
o r S
❖Example: d f
i l e
mp blond
1. Hair_color: black, brown,
o
n t C
n te single, married, divorced
2. Marital_status:
Co teacher, doctor, farmer
3. Occupation:
❖Can also be represented by numbers (1=red, 2=black)
❖No mathematical operations, no meaningful order, not quantitative
❖Possible to find mode – most commonly occurring value

Types of attribute
❖Binary Attributes
❖Nominal attribute with only 2 categories: 0 or 1
P U
❖True/False, Present/Absent, Positive/Negative, ITW Yes/No
❖Examples: S,M
SoC
❖Diabetic: yes/no fo r
❖Cancer: yes/no i l e d
mp
❖Anomalous: true/false C o
e n t
❖Symmetric n t
Co If both states are equally valuable and carry
–
same weight
❖Asymmetric – If outcomes have different importance
❖Most important or rarest outcome is coded as 1
❖Example: Dengue positive: 1 , Dengue negative: 0

Types of attribute
❖Ordinal Attributes
❖The values have meaningful order or ranking among thme
❖Magnitude between successive values is not known P U
ITW
❖Example:
S,M
❖Customer_satisfaction: very satisfied, SoC somewhat satisfied, neutral,
dissatisfied fo r
i l e d
❖Size_of_beverage: o mp medium, large
small,
n t C
❖Professional_rank:
te assistant professor, associate professor,
n
professorCo
❖Useful for registering subjective assessment of qualities
❖Mean cannot be defined, but median and mode can be defined
❖Qualitative attribute – actual quantity not given

Numeric Attributes
❖Interval-Scaled Attributes
❖Measured on the scale of equal-size units
❖Values have order and can be positive or negative PU
❖Difference between values can be comparedM ITW
and quantified
,
❖We cannot speak of values in terms oforatioCS
o r S
❖Mean, median, mode can be calculated
d f
p il e
❖Example: Temperature, om Date
n t C
te
❖Ratio-scaled Attributes
n
Co
❖Numeric attribute with an inherent zero-point
❖Difference and ratio can be calculated
❖Mean, median, mode can be calculated
❖Example: years_of_experience, number_of_words, weight, height

Discrete versus Continuous
❑Discrete attribute – finite or countably infinite set of values
❑Examples: number_of_students, drink_size, customer_id, zipcode
P U
❑Continuous attribute – real numbers, floating-pointITWvariables
, M
❑Example: height S
SoC
fo r
i l e d
o mp
n t C
n te
Co

Data Quality: Why
Preprocess the Data?
Measures for data quality: A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, … P U
◦ T
Consistency: some modified but some not, dangling,
I W…
◦ Timeliness: timely update? S,M
C correct?
oare
◦ Believability: how trustable the data
o r S
◦ d
Interpretability: how easilyethef data can be understood?
p il
om
◦ nt C for more details
Refer to Han-Kamber
te
o n
C

Major Tasks in Data
Preprocessing
Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
P U
Data integration I TW
M
◦ Integration of multiple databases, data cubes,Sor, files
◦ Resolving inconsistencies (customer_idS C
vsocust_id)
fo r
Data reduction- reduced volume i l e dbut same analysis result
◦ Dimensionality reduction
o m–pwavelet transform, PCA
n t
◦ Numerosity reductionC – log linear models, clusters
n te
Co
◦ Data compression
Data transformation and data discretization
◦ Normalization , discretization
◦ Concept hierarchy generation
*Above categorization is not mutually exclusive. Removal of redundant data is data
cleaning as well as data reduction

Forms of Data Preprocessing
P U
ITW
S,M
o C
o r S
d f
p ile
C om
n t
n te
Co

Data Cleaning

Data Cleaning
Data in the Real World is Dirty
Reason: instrument faulty, human or computer error, transmission error
◦ Incomplete: lacking attribute values, lacking certain P U
attributes of
Interest, or containing only aggregate data MI T W
◦ e.g., Occupation = “ ” (missing data) C S,
◦ Noisy: containing noise, errors, orr outliersS o
d fo
p i le
◦ e.g., Salary = “−10” (an error)
◦ Inconsistent: containing
o m discrepancies in codes or names, e.g.,
n t C = “03/07/2010”
◦ Age = “42”, Birthday
n te “1, 2, 3”, now rating “A, B, C”
Co
◦ Was rating
◦ Discrepancy between duplicate records
◦ Intentional (e.g., disguised missing data)
◦ Jan. 1 as everyone’s birthday?

Incomplete (Missing) Data
§ Data is not always available
§ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data U P
ITW
§ Missing data may be due to , M
§ equipment malfunction o CS
§ inconsistent with other recorded o S
rdata and thus deleted
f
§ i l ed
data not entered due topmisunderstanding / privacy issues
§ certain data mayC ombe considered important at the time of entry
not
e n t
§ Missing data
o t need to be inferred
nmay
C
§ Does missing value always imply error in the data? Justify.

How to Handle Missing Data?
q Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values per
attribute varies considerably PU W
M IT
q Fill in the missing value manually: tedious +, infeasible?
C S
q Fill it automatically with So
fo r
i l e d
◦ a global constant : e.g., “unknown”, a new class?!
◦ the attribute meanoor p
mmedian
n
◦ the attribute meant C for all samples belonging to the same class:
smarteront
e
C
◦ the most probable value: inference-based such as Bayesian formula or
decision tree …by considering other attributes

Noisy Data
q What is noise?
q Random error, variance in a measured variable
P U
q How do we identify noise? IT W
S ,M
q Boxplots, Scatter plots, other methods SoC
of data visualization
fo r
q Data Smoothing Techniques i l e d
q Binning o mp
q Regression ent
C
q Outlier o nt
Analysis
C

Binning
Ø Binning methods smooth a sorted data value by consulting its
neighborhood (local smoothing)
Ø Sorted values are distributed into a number ofW P – frequency
equal
U
buckets (bins) MIT
,
o CS
S of bin is replaced by mean
Ø Smoothing by bin means – each rvalue
o
value of bin d f
p il e
Ø Smoothing by bint C om
medians – each bin value is replaced by bin
median n ten
Co
Ø Smoothing by bin boundaries – minimum and maximum values in a
given bin are identified as bin boundaries. Each bin value is replaced
by closest boundary value.
Smaller /Larger the width , greater is the effect of smoothing???

Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 P U
- Bin 2: 21, 21, 24, 25 ITW
S,M
- Bin 3: 26, 28, 29, 34
SoC
* Smoothing by bin means: fo r
i l e d
- Bin 1: 9, 9, 9, 9 o mp
n t C
- Bin 2: 23, 23, 23,
n te 23
o
- Bin 3: 29,C29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
Ø Technique that transforms data values to a function
Ø Linear regression involves finding the best line to fit two attributes
or variables so that one can be used to predict the P U
other
W
IT
M
Ø Example: years of exp to predict salary….. C S,
So
fo r
i l e d
o mp
n t C
n te
Co

Outlier analysis
q Outlier can be detected by clustering
q Outlier detection or anomaly detection is the process of finding data
objects with behaviors that are very different than P U
expectation
W IT
, M
q Applications:
o CS
o r S
q Fraud detection, security, image d f processing, video analysis ,
intrusion detection p il e
C om
e n t
o nt
C

Discussion
Is concept hierarchy a form of data discretization?
Can it be used for data smoothing?
P U
ITW
S,M
o C
o r S
d f
p ile
C om
n t
n te
Co

Tools for discrepancy detection
q Data scrubbing tools use simple domain knowledge (e.g. knowledge
of postal address and spell-check) to detect errors and make
corrections in the data
WPU
q Data auditing tools analyze data to discover, M IT
rules and relationships
S
oC
and detect data that violate such conditions
S
fo r
i e d
q Potter’s Wheel is a publicly lavailable data cleaning tool that does
mp transformation
discrepancy detectionoand
n t C
n te
Co

Data Integration

Data Integration
q Merging of data from multiple data stores
q Problems – redundancies and inconsistencies
P U
q Challenges – matching schema and objects from ITWdifferent sources
M
C S,
So
fo r
i l e d
o mp
n t C
n te
Co

Entity Identification Problem
q Problem of matching equivalent real world entities from multiple
data sources
q How can a data analyst be sure that customer_idW P one database
from
U
and cust_number in another database refer M
to
T
Ithe same attribute?
,
o CS
o r S
q Metadata can help to avoid data integration issues
d f
q Metadata for each attribute
p i l einclude name, meaning, data type,
m
o null rules
range of values permitted,
t C
n ten
o
q FunctionalCdependencies and referential constraints should be
taken care of during data integration

Handling Redundancy in Data Integration
v Redundant data occur often during integration of multiple databases
v Object identification: The same attribute or object
U may have
W P
different names in different databases IT
S,M
v Derivable data: One attribute mayoC be a “derived” attribute in
o r S
another table, e.g., annual revenue
d f
p i l e
o m
v Redundant attributest Cmay be able to be detected by chi-square
t n
efor
correlation test n nominal data, correlation coefficient test for
Co
numeric data or covariance analysis for numeric data.

25
Other Problems in Integration
q Tuple Duplication - redundancy at tuple level
q Denormalization is one cause of redundancy
P U
IT
q Data value conflict detection – ‘weight’ attributeWmay be stored in
M
different measurement systems in different
C S, databases
So
for are different for different
q Currencies and tax calculationdrules
countries p ile
C om
e n t
o nt
C

Data Reduction

Strategies
v Discrete Wavelet Transforms(DWT)
v Principal Components Analysis (PCA)
P U
v Attribute subset selection ITW
S,M
v Clustering o C
o r S
v Sampling d f
p ile
C
v Data Cube Aggregation om
e n t
o nt
C

Attribute Subset Selection
§ In multi-dimensional data, some attributes may be irrelevant to the
data mining task
§ Example – If the task is to classify customers based P
on whether or
U
ITW
, MCD at the store
not they are likely to purchase a popular new
§ o
Relevant attributes – age, music_taste CS
o r S
§
d f
Irrelevant attributes – telephone number
i l e
p relevant attributes , but time-
§ Domain expert can pick o mout
consuming n t C
n te
§ Attribute C o selection (Feature subset selection in ML) reduces
subset
data set size by removing irrelevant attributes

Finding good subset
§ For ‘n’ attributes, there are 2n subsets
§ Heuristic (greedy) methods are used for attribute subset selection
P U
§ These methods make locally optimal choice,M ITW that it will lead
hoping
to global optimal solution C S,
So
for
§ Best attributes are decided bydmeasures such as ‘information gain’
p il e
C om
e n t
o nt
C

Sampling
v Data reduction technique
v Allows a large dataset to be represented by a much smaller data
sample P U
W
IT
M
S, that is potentially
v Allows a mining algorithm to run in complexity
C
sub-linear to the size of the data r So
d fo
p ile
v Key principle: Choose a representative subset of the data
o m
t C
n ten
Co

Types of Sampling
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement P U
IT W
M
• A selected object is not removed from the ,population
S
• Cluster sample SoC
fo r
• If tuples in D are groupedlinto
i e d M disjoint clusters, then a sample from
mp
each cluster can be obtained
o
n t C
n te
• Stratified sampling
Cothe data set, and draw samples from each partition
• Partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data
• Example – creating a stratum for each age group

Sampling: with or without Replacement
P U
ITW
S R SWOR ndom
S,M
ple r
a
(sim le withou
t
o C
samp ment)o r S
d
ce f
p ile
repla
C om
n t
n te
Co
SRSW
R
Raw Data

Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
P U
ITW
S,M
o C
o r S
d f
p ile
C om
n t
n te
Co

Data Transformation

Data Transformation
1. Data are transformed and consolidated so that the resulting mining
process is efficient.
2. Strategies for data transformation P U
ITW
1. Smoothing
S,M
2. oC
Attribute construction – attribute discovery
S
3. Aggregation fo r
4. Normalization i l e d
p
5. Discretization Com
e n t generation
6. t
Conceptnhierarchy
o
C

Normalization
Ø Normalizing the data attempts to give all attributes an equal weight
Ø For distance based methods, normalization helps prevent attributes with
P U with smaller
initially large ranges (e.g. income) from outweighing attributes
ranges (e.g. age) IT W
, M
C S
Ø It removes dependence on measurement Sounits
fo r
Ø Normalization involves transforming
i l e d the data to fall within a smaller or
common range such as [-1,1]
o mpor [0.0,1.0]
n t C
Ø Normalization iste
useful for algorithms like Neural Networks, or distance
n
Co like Nearest Neighbour classification as well as Clustering
based algorithms
Ø Methods:
Ø Min-max normalization
Ø Z-score normalization
Ø Decimal Scaling

Min- Max Normalization
• Let A be a numeric attribute (e.g. income) with n observed values
• Let minA and maxA be the minmum and maximum values of A
• Min-Max Normalization maps a value v of A to v’ in the range [

new_minA, new_maxA]
v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
• Ex. Let income range $12,000 to $98,000 be normalized to [0.0, 1.0]
• Then $73,600 is mapped to

73,600 - 12,000
(1.0 - 0) + 0 = 0.716
98,000 - 12,000
Discretization
Discretization: Divide the range of a continuous attribute into
intervals
◦ Interval labels can then be used to replace actual data
P Uvalues
◦ Reduce data size by discretization ITW
◦ Supervised vs. unsupervised S ,M
◦ Split (top-down) vs. merge (bottom-up)SoC
fo r
◦ Discretization can be performed
i l e d recursively on an attribute
◦ mp e.g., classification
Prepare for further analysis,
o
n t C
n te
C o

Extra

Data Wrangling
v Data Wrangling is the process of converting and mapping data from
its raw form to another format with the purpose of making it more
valuable and appropriate for advance tasks such as
P UData Analytics
and Machine Learning. ITW
S ,M
v Difference between Data Wrangling and
SoC ETL
o r
v Users – Business Analysts vs ITfemployees
d structured
ilewell
v Data – diverse, complexpvs
om vs Reporting & Analysis
v Use Cases – Exploratory
C
e n t
v Yet, Data C ont
Wrangling and ETL are complementary

Data Preprocessing: Istructor Prof. Surabhi Thatte Chapter 2 and Chapter 3 of Han-Kamber 3 Edition

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing: Istructor Prof. Surabhi Thatte Chapter 2 and Chapter 3 of Han-Kamber 3 Edition

Uploaded by

Copyright:

Available Formats

Data Preprocessing

ISTRUCTOR PROF. SURABHI THATTE

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 2

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 3

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 4

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 5

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 6

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 7

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 8

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 9

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 10

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 11

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 12

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 13

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 14

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 15

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 16

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 18

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 19

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 20

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 21

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 22

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 23

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 24

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 25

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 26

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 27

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 28

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 29

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 30

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 31

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 32

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 33

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 34

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 35

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 36

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 37

• Min-Max Normalization maps a value v of A to v’ in the range [

• Then $73,600 is mapped to

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 39

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 40

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 41

You might also like