You are on page 1of 5

Data Scaling

Why we need to scale data?

● Input variables may have different units (e.g. feet, kilometers, and hours) that, in turn, may
mean the variables have different scales
● Differences in the scales across input variables may increase the difficulty of the problem
being modeled
● Whether input variables require scaling depends on the specifics of your problem and of
each variable
● Problems can be complex and it may not be clear how to best scale input data
Types of Scaling methods

● Normalization:
○ Rescaling of the data from the original range so that all values are within the range of 0 and 1
○ It is also known as MinMax Scaler

● Standardization:
○ Standardizing a dataset involves rescaling the distribution of values so that the mean of observed
values is 0 and the standard deviation is 1
○ Standardization assumes that your observations fit a Normal distribution
Normalization

● Normalization requires that you know or are able to accurately estimate the minimum
and maximum observable values
● Normalization Formula:
○ y = (x - min) / (max - min)
● Generally, we import MinMaxScaler from sklearn.preprocessing for Normalization
● The default scale for the MinMaxScaler is to rescale variables into the range [0,1],
although a preferred scale can be specified
Standardization

● Standardization requires that you know or are able to accurately estimate the mean and
standard deviation of observable values
● Standardization Formula:
○ y = (x - mean) / standard_deviation
● Generally, we import StandardScaler from sklearn.preprocessing for Standardization
● This can be thought of as subtracting the mean value or centering the data

You might also like