You are on page 1of 17

Foundations of

Data Science
Unit 4
Acknowledgement
▪ Most of the slides in this presentation are taken from material provided by
▪ Han and Kimber (Data Mining Concepts and Techniques) and
▪ Tan, Steinbach and Kumar (Introduction to Data Mining)

Zarmeen
Spring 2021 2
Nasim
Data Preprocessing
Data Preprocessing
▪ The data preprocessing is important because of the errors associated with the
data collection process.
▪ Methods are needed to remove or correct missing and erroneous entries from the
data.
▪ There are several important aspects of data preprocessing:
▪ Handling Missing Values
▪ Scaling and Normalization
▪ Data Transformation
▪ Data Reduction

Zarmeen
Spring 2021 4
Nasim
Missing Data
▪ Data is not always available
▪ E.g., many tuples have no recorded value for several attributes, such as user location in
twitter data

▪ Missing data may be due to


▪ equipment malfunction
▪ inconsistent with other recorded data and thus deleted
▪ data not entered due to misunderstanding
▪ certain data may not be considered important at the time of entry
▪ not register history or changes of the data

▪ Missing data may need to be inferred


Zarmeen
Spring 2021 5
Nasim
Handling Missing Data
▪ Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably

▪ Fill in the missing value manually: tedious + infeasible?

▪ Fill in it automatically with


▪ a global constant : e.g., “unknown”, a new class?!
▪ the attribute mean
▪ the attribute mean for all samples belonging to the same class

Zarmeen
Spring 2021 6
Nasim
Handling Missing Data
▪ The missing values may be estimated or imputed. However, errors created by the
imputation process may affect the results of the data mining algorithm.
▪ The analytical phase is designed in such a way that it can work with missing
values. Many data mining methods are inherently designed to work robustly with
missing values.

Zarmeen
Spring 2021 7
Nasim
Scaling and Normalization
▪ Some data mining methods, typically those that are based on
distance computation between points in an n-dimensional space,
may need normalized data for best results.
▪ If the values are not normalized, the distance measures will
overweight those features that have, on average, larger values
▪ Normalization Techniques:
▪ Decimal Scaling
▪ Min-Max Normalization
▪ Standard Deviation Normalization

Zarmeen
Spring 2021 8
Nasim
Scaling and Normalization
Decimal Scaling
v’(i) = v(i) / 10k
For the smallest k such that max |v’(i)|< 1.
▪ Let the input data is: -10, 201, 301, -401, 501, 601, 701
▪ To normalize the above data,
Step 1: Maximum value in given data: 701
Step 2: Divide the given data by 1000 (i.e k=3)
▪ Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701

Zarmeen
Spring 2021 9
Nasim
Scaling and Normalization
▪ Min-Max Normalization
v’(i) = [v(i) – min(v(i))]/[max(v(i)) – min(v(i))]
▪ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,600 is mapped to 73,600  12,000  0.716
98,000  12,000

▪ Standard Deviation Normalization


▪ v’(i) = [v(i) – mean(v)]/sd(v)

Zarmeen
Spring 2021 10
Nasim
Normalization Example
▪ Given one-dimensional data set X = {-5.0, 23.0, 17.6, 7.23, 1.11}, normalize the
data set using
▪ Decimal scaling.
▪ Min-max normalization.
▪ Standard deviation normalization.

Zarmeen
Spring 2021 11
Nasim
Data transformation: Discretization
▪ Discretization, also called binning, converts numeric attributes into categorical
ones.
▪ It is usually applied for data mining methods that cannot handle numeric
attributes.
▪ It can also help in reducing the number of values for an attribute, especially if
there is noise in the numeric measurements; discretization allows one to ignore
small and irrelevant differences in the values.

Zarmeen
Spring 2021 12
Nasim
Methods of discretization
▪ Equal-Width Intervals
▪ The simplest binning approach is to partition the range of
X into k equal-width intervals.
▪ Width of Bin = (Max–Min)/k

▪ Equal-Frequency Intervals
▪ In equal-frequency binning we divide the range of X into
intervals that contain (approximately) equal number of
points;

For the both methods, the best way of determining k is by


looking at the histogram and try different intervals or
groups.

Zarmeen
Spring 2021 13
Nasim
Data Reduction
▪ The goal of data reduction is to represent it more compactly. The three basic
operations in a data-reduction process are:
▪ Delete a row
▪ Delete a column (dimensionality reduction)
▪ Reduce the number of values in a column (smooth a feature)

▪ Data reduction does result in some loss of information.

Zarmeen
Spring 2021 14
Nasim
Data Reduction (cont’d)
▪ The main advantages of data reduction are
▪ Computing time – simpler data can hopefully lead to a reduction in the time taken for
data mining.
▪ Predictive/descriptive accuracy – We generally expect that by using only relevant
features, a data mining algorithm can not only learn faster but with higher accuracy.
Irrelevant data may mislead a learning process.
▪ Representation of the data-mining model – The simplicity of representation often
implies that a model can be better understood.

Zarmeen
Spring 2021 15
Nasim
Feature Extraction
▪ An important phase of the data mining process is creating a set of features that the analyst
can work with.
▪ It is important to remove redundant features which duplicate much or all of the information
contained in one or more other attributes
▪ Example: purchase price of a product and the amount of sales tax paid

▪ It is also important to remove irrelevant features that contain no information that is useful for
the data mining task at hand
▪ Example: students' ID is often irrelevant to the task of predicting students' GPA

Zarmeen
Spring 2021 16
Nasim
Knime Demo

You might also like