You are on page 1of 41

Data Preprocessing

ISTRUCTOR PROF. SURABHI THATTE


Chapter 2 and Chapter 3 of Han-Kamber 3rd edition
Data objects and attributes
❖Data sets are made up of data objects
❖Data object – entity, sample, example, instance, data point,
U tuple, row
W P
❖Attribute – data field, dimension, feature, variable M IT
C S,
❖Observation – observed value of an attributeSo
fo r
i
❖Attribute vector – Feature vectorl e d – a set of attributes used to describe
mp
a given object C o
e n t
n t
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 2


Types of attribute
❖Nominal = Categorical
❖Relating to names
❖Values are symbols or names of things P U
W
❖Each value represents a category, code, stateMIT
S,
❖No meaningful order o C
o r S
❖Example: d f
i l e
mp blond
1. Hair_color: black, brown,
o
n t C
n te single, married, divorced
2. Marital_status:
Co teacher, doctor, farmer
3. Occupation:
❖Can also be represented by numbers (1=red, 2=black)
❖No mathematical operations, no meaningful order, not quantitative
❖Possible to find mode – most commonly occurring value

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 3


Types of attribute
❖Binary Attributes
❖Nominal attribute with only 2 categories: 0 or 1
P U
❖True/False, Present/Absent, Positive/Negative, ITW Yes/No
❖Examples: S,M
SoC
❖Diabetic: yes/no fo r
❖Cancer: yes/no i l e d
mp
❖Anomalous: true/false C o
e n t
❖Symmetric n t
Co If both states are equally valuable and carry

same weight
❖Asymmetric – If outcomes have different importance
❖Most important or rarest outcome is coded as 1
❖Example: Dengue positive: 1 , Dengue negative: 0

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 4


Types of attribute
❖Ordinal Attributes
❖The values have meaningful order or ranking among thme
❖Magnitude between successive values is not known P U
ITW
❖Example:
S,M
❖Customer_satisfaction: very satisfied, SoC somewhat satisfied, neutral,
dissatisfied fo r
i l e d
❖Size_of_beverage: o mp medium, large
small,
n t C
❖Professional_rank:
te assistant professor, associate professor,
n
professorCo
❖Useful for registering subjective assessment of qualities
❖Mean cannot be defined, but median and mode can be defined
❖Qualitative attribute – actual quantity not given

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 5


Numeric Attributes
❖Interval-Scaled Attributes
❖Measured on the scale of equal-size units
❖Values have order and can be positive or negative PU
❖Difference between values can be comparedM ITW
and quantified
,
❖We cannot speak of values in terms oforatioCS
o r S
❖Mean, median, mode can be calculated
d f
p il e
❖Example: Temperature, om Date
n t C
te
❖Ratio-scaled Attributes
n
Co
❖Numeric attribute with an inherent zero-point
❖Difference and ratio can be calculated
❖Mean, median, mode can be calculated
❖Example: years_of_experience, number_of_words, weight, height

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 6


Discrete versus Continuous
❑Discrete attribute – finite or countably infinite set of values
❑Examples: number_of_students, drink_size, customer_id, zipcode
P U
❑Continuous attribute – real numbers, floating-pointITWvariables
, M
❑Example: height S
SoC
fo r
i l e d
o mp
n t C
n te
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 7


Data Quality: Why
Preprocess the Data?
Measures for data quality: A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, … P U
◦ T
Consistency: some modified but some not, dangling,
I W…
◦ Timeliness: timely update? S,M
C correct?
oare
◦ Believability: how trustable the data
o r S
◦ d
Interpretability: how easilyethef data can be understood?
p il
om
◦ nt C for more details
Refer to Han-Kamber
te
o n
C

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 8


Major Tasks in Data
Preprocessing
Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
P U
Data integration I TW
M
◦ Integration of multiple databases, data cubes,Sor, files
◦ Resolving inconsistencies (customer_idS C
vsocust_id)
fo r
Data reduction- reduced volume i l e dbut same analysis result
◦ Dimensionality reduction
o m–pwavelet transform, PCA
n t
◦ Numerosity reductionC – log linear models, clusters
n te
Co
◦ Data compression
Data transformation and data discretization
◦ Normalization , discretization
◦ Concept hierarchy generation
*Above categorization is not mutually exclusive. Removal of redundant data is data
cleaning as well as data reduction

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 9


Forms of Data Preprocessing

P U
ITW
S,M
o C
o r S
d f
p ile
C om
n t
n te
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 10


Data Cleaning

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 11


Data Cleaning
Data in the Real World is Dirty
Reason: instrument faulty, human or computer error, transmission error
◦ Incomplete: lacking attribute values, lacking certain P U
attributes of
Interest, or containing only aggregate data MI T W
◦ e.g., Occupation = “ ” (missing data) C S,
◦ Noisy: containing noise, errors, orr outliersS o
d fo
p i le
◦ e.g., Salary = “−10” (an error)
◦ Inconsistent: containing
o m discrepancies in codes or names, e.g.,
n t C = “03/07/2010”
◦ Age = “42”, Birthday
n te “1, 2, 3”, now rating “A, B, C”
Co
◦ Was rating
◦ Discrepancy between duplicate records
◦ Intentional (e.g., disguised missing data)
◦ Jan. 1 as everyone’s birthday?

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 12


Incomplete (Missing) Data
§ Data is not always available
§ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data U P
ITW
§ Missing data may be due to , M
§ equipment malfunction o CS
§ inconsistent with other recorded o S
rdata and thus deleted
f
§ i l ed
data not entered due topmisunderstanding / privacy issues
§ certain data mayC ombe considered important at the time of entry
not
e n t
§ Missing data
o t need to be inferred
nmay
C
§ Does missing value always imply error in the data? Justify.

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 13


How to Handle Missing Data?
q Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values per
attribute varies considerably PU W
M IT
q Fill in the missing value manually: tedious +, infeasible?
C S
q Fill it automatically with So
fo r
i l e d
◦ a global constant : e.g., “unknown”, a new class?!
◦ the attribute meanoor p
mmedian
n
◦ the attribute meant C for all samples belonging to the same class:
smarteront
e
C
◦ the most probable value: inference-based such as Bayesian formula or
decision tree …by considering other attributes

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 14


Noisy Data
q What is noise?
q Random error, variance in a measured variable
P U
q How do we identify noise? IT W
S ,M
q Boxplots, Scatter plots, other methods SoC
of data visualization
fo r
q Data Smoothing Techniques i l e d
q Binning o mp
q Regression ent
C
q Outlier o nt
Analysis
C

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 15


Binning
Ø Binning methods smooth a sorted data value by consulting its
neighborhood (local smoothing)
Ø Sorted values are distributed into a number ofW P – frequency
equal
U
buckets (bins) MIT
,
o CS
S of bin is replaced by mean
Ø Smoothing by bin means – each rvalue
o
value of bin d f
p il e
Ø Smoothing by bint C om
medians – each bin value is replaced by bin
median n ten
Co
Ø Smoothing by bin boundaries – minimum and maximum values in a
given bin are identified as bin boundaries. Each bin value is replaced
by closest boundary value.
Smaller /Larger the width , greater is the effect of smoothing???

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 16


Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 P U
- Bin 2: 21, 21, 24, 25 ITW
S,M
- Bin 3: 26, 28, 29, 34
SoC
* Smoothing by bin means: fo r
i l e d
- Bin 1: 9, 9, 9, 9 o mp
n t C
- Bin 2: 23, 23, 23,
n te 23
o
- Bin 3: 29,C29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 17
Regression
Ø Technique that transforms data values to a function
Ø Linear regression involves finding the best line to fit two attributes
or variables so that one can be used to predict the P U
other
W
IT
M
Ø Example: years of exp to predict salary….. C S,
So
fo r
i l e d
o mp
n t C
n te
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 18


Outlier analysis
q Outlier can be detected by clustering
q Outlier detection or anomaly detection is the process of finding data
objects with behaviors that are very different than P U
expectation
W IT
, M
q Applications:
o CS
o r S
q Fraud detection, security, image d f processing, video analysis ,
intrusion detection p il e
C om
e n t
o nt
C

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 19


Discussion
Is concept hierarchy a form of data discretization?
Can it be used for data smoothing?
P U
ITW
S,M
o C
o r S
d f
p ile
C om
n t
n te
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 20


Tools for discrepancy detection
q Data scrubbing tools use simple domain knowledge (e.g. knowledge
of postal address and spell-check) to detect errors and make
corrections in the data
WPU
q Data auditing tools analyze data to discover, M IT
rules and relationships
S
oC
and detect data that violate such conditions
S
fo r
i e d
q Potter’s Wheel is a publicly lavailable data cleaning tool that does
mp transformation
discrepancy detectionoand
n t C
n te
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 21


Data Integration

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 22


Data Integration
q Merging of data from multiple data stores
q Problems – redundancies and inconsistencies
P U
q Challenges – matching schema and objects from ITWdifferent sources
M
C S,
So
fo r
i l e d
o mp
n t C
n te
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 23


Entity Identification Problem
q Problem of matching equivalent real world entities from multiple
data sources
q How can a data analyst be sure that customer_idW P one database
from
U
and cust_number in another database refer M
to
T
Ithe same attribute?
,
o CS
o r S
q Metadata can help to avoid data integration issues
d f
q Metadata for each attribute
p i l einclude name, meaning, data type,
m
o null rules
range of values permitted,
t C
n ten
o
q FunctionalCdependencies and referential constraints should be
taken care of during data integration

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 24


Handling Redundancy in Data Integration
v Redundant data occur often during integration of multiple databases
v Object identification: The same attribute or object
U may have
W P
different names in different databases IT
S,M
v Derivable data: One attribute mayoC be a “derived” attribute in
o r S
another table, e.g., annual revenue
d f
p i l e
o m
v Redundant attributest Cmay be able to be detected by chi-square
t n
efor
correlation test n nominal data, correlation coefficient test for
Co
numeric data or covariance analysis for numeric data.

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 25


25
Other Problems in Integration
q Tuple Duplication - redundancy at tuple level
q Denormalization is one cause of redundancy
P U
IT
q Data value conflict detection – ‘weight’ attributeWmay be stored in
M
different measurement systems in different
C S, databases
So
for are different for different
q Currencies and tax calculationdrules
countries p ile
C om
e n t
o nt
C

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 26


Data Reduction

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 27


Strategies
v Discrete Wavelet Transforms(DWT)
v Principal Components Analysis (PCA)
P U
v Attribute subset selection ITW
S,M
v Clustering o C
o r S
v Sampling d f
p ile
C
v Data Cube Aggregation om
e n t
o nt
C

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 28


Attribute Subset Selection
§ In multi-dimensional data, some attributes may be irrelevant to the
data mining task
§ Example – If the task is to classify customers based P
on whether or
U
ITW
, MCD at the store
not they are likely to purchase a popular new
§ o
Relevant attributes – age, music_taste CS
o r S
§
d f
Irrelevant attributes – telephone number
i l e
p relevant attributes , but time-
§ Domain expert can pick o mout
consuming n t C
n te
§ Attribute C o selection (Feature subset selection in ML) reduces
subset
data set size by removing irrelevant attributes

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 29


Finding good subset
§ For ‘n’ attributes, there are 2n subsets
§ Heuristic (greedy) methods are used for attribute subset selection
P U
§ These methods make locally optimal choice,M ITW that it will lead
hoping
to global optimal solution C S,
So
for
§ Best attributes are decided bydmeasures such as ‘information gain’
p il e
C om
e n t
o nt
C

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 30


Sampling
v Data reduction technique
v Allows a large dataset to be represented by a much smaller data
sample P U
W
IT
M
S, that is potentially
v Allows a mining algorithm to run in complexity
C
sub-linear to the size of the data r So
d fo
p ile
v Key principle: Choose a representative subset of the data
o m
t C
n ten
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 31


Types of Sampling
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement P U
IT W
M
• A selected object is not removed from the ,population
S
• Cluster sample SoC
fo r
• If tuples in D are groupedlinto
i e d M disjoint clusters, then a sample from
mp
each cluster can be obtained
o
n t C
n te
• Stratified sampling
Cothe data set, and draw samples from each partition
• Partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data
• Example – creating a stratum for each age group

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 32


Sampling: with or without Replacement

P U
ITW
S R SWOR ndom
S,M
ple r
a
(sim le withou
t
o C
samp ment)o r S
d
ce f
p ile
repla

C om
n t
n te
Co
SRSW
R

Raw Data

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 33


Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample

P U
ITW
S,M
o C
o r S
d f
p ile
C om
n t
n te
Co

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 34


Data Transformation

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 35


Data Transformation
1. Data are transformed and consolidated so that the resulting mining
process is efficient.
2. Strategies for data transformation P U
ITW
1. Smoothing
S,M
2. oC
Attribute construction – attribute discovery
S
3. Aggregation fo r
4. Normalization i l e d
p
5. Discretization Com
e n t generation
6. t
Conceptnhierarchy
o
C

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 36


Normalization
Ø Normalizing the data attempts to give all attributes an equal weight
Ø For distance based methods, normalization helps prevent attributes with
P U with smaller
initially large ranges (e.g. income) from outweighing attributes
ranges (e.g. age) IT W
, M
C S
Ø It removes dependence on measurement Sounits
fo r
Ø Normalization involves transforming
i l e d the data to fall within a smaller or
common range such as [-1,1]
o mpor [0.0,1.0]
n t C
Ø Normalization iste
useful for algorithms like Neural Networks, or distance
n
Co like Nearest Neighbour classification as well as Clustering
based algorithms
Ø Methods:
Ø Min-max normalization
Ø Z-score normalization
Ø Decimal Scaling

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 37


Min- Max Normalization
• Let A be a numeric attribute (e.g. income) with n observed values
• Let minA and maxA be the minmum and maximum values of A

• Min-Max Normalization maps a value v of A to v’ in the range [


new_minA, new_maxA]

v - minA
v' = (new _ maxA - new _ minA) + new _ minA
maxA - minA
• Ex. Let income range $12,000 to $98,000 be normalized to [0.0, 1.0]

• Then $73,600 is mapped to


73,600 - 12,000
(1.0 - 0) + 0 = 0.716
98,000 - 12,000
4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 38
Discretization
Discretization: Divide the range of a continuous attribute into
intervals
◦ Interval labels can then be used to replace actual data
P Uvalues
◦ Reduce data size by discretization ITW
◦ Supervised vs. unsupervised S ,M
◦ Split (top-down) vs. merge (bottom-up)SoC
fo r
◦ Discretization can be performed
i l e d recursively on an attribute
◦ mp e.g., classification
Prepare for further analysis,
o
n t C
n te
C o

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 39


Extra

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 40


Data Wrangling
v Data Wrangling is the process of converting and mapping data from
its raw form to another format with the purpose of making it more
valuable and appropriate for advance tasks such as
P UData Analytics
and Machine Learning. ITW
S ,M
v Difference between Data Wrangling and
SoC ETL
o r
v Users – Business Analysts vs ITfemployees
d structured
ilewell
v Data – diverse, complexpvs
om vs Reporting & Analysis
v Use Cases – Exploratory
C
e n t
v Yet, Data C ont
Wrangling and ETL are complementary

4/8/21 COMPILED BY PROF. SURABHI THATTE, MITWPU 41

You might also like