Professional Documents
Culture Documents
Data Preprocessing
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Summary
Data Preprocessing: Why Preprocess the Data?
3
Data cleaning
Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies.
Data integration
Integration of multiple databases or files.
Data reduction
Dimensionality reduction.
Numerosity reduction.
Data transformation
Normalization.
Concept hierarchy generation.
Discretization
Data in the Real World is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error.
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data.
◼ e.g., Occupation=“ ” (missing data)
Noisy: containing noise, errors, or outliers.
◼ e.g., Salary=“−10” (an error)
Inconsistent: containing contradictories in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”.
Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
Incomplete (Missing) Values
7
1. Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the percentage of missing values per attribute
varies considerably.
2. Fill in the missing value :
Manually: time consuming+ infeasible (large data and many missing values)
Use a global constant (such as a label like “Unknown” or −∞ or “NA”)
Use the central tendency for the attribute (e.g., the mean or median)
Use the attribute mean/median for all samples belonging to the same class.
Use the most probable value.
Noisy Data
9
Binning
First, sort data and partition into (equal-frequency) bins.
Then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regression
Smooth by fitting the data into regression functions.
Outlier Analysis
Detect and remove outliers using clustering.
◼ Values that fall outside of the set of clusters may be considered outliers.
Combined computer and human inspection.
Example: Binning To Handle Noise
3
equal-width: It divides the
or
range into N intervals of equal
size.
Width= (max-min/#bins)
Example: Width=(34-4/3)=10
4-13,14-23, 24-34
Binning methods for data smoothing.
Data Integration
12
Data integration:
The merging of data from multiple sources into a coherent store.
Challenges:
Entity identification problem: How to match schemas and objects from different
sources?
Redundancy and Correlation Analysis: Are any attributes correlated?
Entity Identification Problem
13
How can equivalent real-world entities from multiple data sources be matched up?
Same attribute or object may have different names in different databases.
Attribute name : e.g., A.cust-id B.cust-#.
Attribute values : e.g., Bill Clinton William Clinton
If both kept → redundancy
Schema integration and object matching are tricky:
Integrate metadata from different sources.
Match equivalent real-world entities from multiple sources.
Detecting and resolving data value conflicts.
◼ Possible reasons: different representations, different scales.
Metadata can be used to avoid errors in schema integration.
◼ Examples of metadata for attributes : name, meaning, data type, and range of values permitted.
Attribute Redundancy and Correlation Analysis
14
Data Type
Correlation coefficient
Numerical
Covariance
Correlation Analysis (Nominal Data):
Chi-Square
16
Χ2 (chi-square) test:
Observed value is the actual count & Expected value is the expected frequency.
(Observed − Expected ) 2
2 =
Expected
Expected values are calculated using :
𝑐𝑜𝑢𝑛𝑡 𝐴 = 𝑎𝑖 × 𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝑎𝑗 )
𝑒𝑖𝑗 =
𝑛
The larger the Χ2 value, the highest the correlation.
The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count.
Dataset
17 Male Fiction
Degree of freedom (df) = df = (r-1)(c-1) where r is the number of rows and c is the
number of columns.
df= (2-1)(2-1) = 1
From the below table, the Χ2 rejected value for 0.001 significant level is 10.828, then
the results show that “preferred reading” and “gender” are strongly correlated in the
given group of people.
Correlation Analysis (Numeric Data):
Correlation Coefficient
19
Correlation coefficient:
where n is the number of tuples, 𝐴ҧ and 𝐵ത are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
rA,B =0 Independent
Covariance:
where n is the number of tuples, and 𝐴ҧ and 𝐵ത are the respective mean or expected values of A and
B
Covariance is similar to correlation:
Correlation
Coefficient
Positive covariance: If A and B both tend to be larger than their expected values
THEN Cov(A,B) > 0 → they rise together
Negative covariance: If A is larger than its expected value, while B is smaller than its
expected value THEN Cov(A,B) < 0.
Independence: If A and B are independent then Cov(A,B) = 0.
)
1 𝐸 𝐴𝑙𝑙𝐸𝑙𝑒𝑐𝑡𝑟𝑜𝑛𝑖𝑐𝑠 = 6 + 5 + 4 + 3 + (2 = 4
5
20 + 10 + 14 + 5 + (5)
2 𝐸 𝐻𝑖𝑔ℎ𝑇𝑒𝑐ℎ = = 10.80
5
3 𝒄𝒐𝒗 𝑨𝒍𝒍𝑬𝒍𝒆𝒄𝒕𝒓𝒐𝒏𝒊𝒄𝒔, 𝑯𝒊𝒈𝒉𝑻𝒆𝒄𝒉 =
6 × 20 + 5 × 10 + 4 × 14 + 3 × 5 + (2 × 5)
− 4 × 10.80 =
5
50.2 − 43.2 = 𝟕
Data Transformation
23
Discretization
Encoding
Aggregation
24
Normalization
Data Transformation: Strategies
25
Discretization: Raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
The labels can be recursively organized into higher-level concepts, resulting in a concept hierarchy
for the numeric attribute.
Concept hierarchy generation for nominal data: replacing low level concepts by
higher level concepts
i.e. attributes such as street can be generalized to higher-level concepts, like city or country.
Data Transformation by Normalization
27
v − A
v' =
A
73,600 − 54,000
Example: Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
Normalization by decimal scaling:
Original Data
Approximated
Summary
33