Professional Documents
Culture Documents
Once your data is all in numerical format, there's one more transformation you'll
probably want to do to it.
In other words, making sure all of your numerical data is on the same scale.
For example, say you were trying to predict the sale price of cars and the number
of kilometres on their odometers varies from 6,000 to 345,000 but the median
previous repair cost varies from 100 to 1,700. A machine learning algorithm may
have trouble finding patterns in these wide-ranging variables.
Normalization (also called min-max scaling) - This rescales all the numerical
values to between 0 and 1, with the lowest value being close to 0 and the highest
previous value being close to 1. Scikit-Learn provides functionality for this in
the MinMaxScalar class.
Standardization - This subtracts the mean value from all of the features (so the
resulting features have 0 mean). It then scales the features to unit variance (by
dividing the feature by the standard deviation). Scikit-Learn provides
functionality for this in the StandardScalar class.
A couple of things to note.
For further information on this topic, I'd suggest the following resources.