You are on page 1of 49

Data Preprocessing

Detecting Missing Values


Why is Data Preprocessing Important
• The main objective of this step is to ensure and check the quality of
data before applying any Machine Learning or Data Mining
methods. Let’s review some of its benefits –
• Accuracy - Data Preprocessing will ensure that input data is accurate and
reliable by ensuring there are no manual entry errors, no duplicates, etc.
• Completeness - It ensures that missing values are handled, and data is
complete for further analysis.
• Consistent - Data Preprocessing ensures that input data is consistent, i.e.,
the same data kept in different places should match.
• Timeliness - Whether data is updated regularly and on a timely basis or
not.
• Trustable - Whether data is coming from trustworthy sources or not.
• Interpretability - Raw data is generally unusable, and Data
Preprocessing converts raw data into an interpretable format.
Data Cleaning
• Data cleaning means fixing bad data in your data
set.
• Bad data could be:
• Empty cells biggest data
• Data in wrong format cleaning task,
• Wrong data missing values.
• Duplicates
Sources of Missing Values

• User forgot to fill in a field.


• Data was lost while transferring manually from a
legacy database.
• There was a programming error.
• Users chose not to fill out a field tied to their beliefs
about how the results would be used or
interpreted.
Data Cleaning: Handling Missing Values

• Input data can contain missing or NULL values,


which must be handled before applying any
Machine Learning or Data Mining techniques.
• Missing values can be handled by many
techniques, such as
• removing rows/columns containing NULL values and
• imputing NULL values using mean, mode, regression,
etc.
Data Cleaning: Missing Values
• Before you start cleaning a data set, it’s a good idea to just get a
general feel for the data. After that, you can put together a plan to
clean the data.
• Do I have missing values? How are they expressed in the data? Should I withhold samples with
missing values? Or should I replace them? If so, which values should they be replaced with?

• I like to start by asking the following questions:


• What are the features?
• What are the expected types (int, float, string, boolean)?
• Is there obvious missing data (values that Pandas can detect)?
• Is there other types of missing data that’s not so obvious (can’t
easily detect with Pandas)?
Sample Data Set
property data.csv

OWN_OCCU
PID ST_NUM ST_NAME PIED NUM_BEDROOMS NUM_BATH
100001000 104 ANGALLU Y 3 1
100002000 197 MADANAPALLE N 3 1.5
100003000 MADANAPALLE N n/a 1
100004000 201 TEMPLE 12 1 NaN
203 TEMPLE Y 3 2
100006000 207 TEMPLE Y NA 1
100007000 NA KOTAKOTTA 2 HURLEY
100008000 213 DNR Y -- 1
100009000 215 DNR Y na 2
What are the features?
what are my features?
• ST_NUM: Street number
• ST_NAME: Street name
• OWN_OCCUPIED: Is the residence owner occupied
• NUM_BEDROOMS: Number of bedrooms
• NUM_BATH : Number of bathrooms

what are the expected types?


ST_NUM: float or int… some sort of numeric type
ST_NAME: string
OWN_OCCUPIED: string… Y (“Yes”) or N (“No”)
NUM_BEDROOMS: float or int, a numeric type
NUM_BATH : float or int, a numeric type
what are types in info()?
Standard Missing Values
• These are missing values that Pandas can detect.
• Pandas will recognize both empty cells and “NA”
types as missing values.
• Ex. let’s take a look at the “ST_NUM” column in
ST_NUM
our dataset 104
197

201
203
207
NA
213
215
Non-Standard Missing Values
• Sometimes it might be the case where there’s In this column,
missing values that have different formats. there’s four
missing values.
• Let’s take a look at the “Number of Bedrooms”
column to see what I mean.

If there’s multiple users manually entering data, then this is a common


problem. Maybe i like to use “n/a” but you like to use “na”.
Non-Standard Missing Values
• An easy way to detect these various formats is to put them in a list.
• Then when we import the data, Pandas will recognize them right
away.
• Here’s an example of how we would do that.

all of the different formats were recognized as missing values.


Non-Standard Missing Values

• It’s important to recognize these non-standard


types of missing values for purposes of
summarizing and transforming missing values.

• If you try and count the number of missing values


before converting these non-standard types, you
could end up missing a lot of missing values.
Unexpected Missing Values
• if our feature is expected to be a string, but there’s
a numeric type, then technically this is also a
missing value.
Unexpected Missing Values
• if our feature is expected to be a string, but there’s a numeric type,
then technically this is also a missing value.
• Fourth row, there’s the number 12. The response for Owner Occupied
should clearly be a string (Y or N), so this numeric type should be a
missing value.
Unexpected Missing Values
• detecting these types of missing values we will these steps
1. Loop through the column: i.e., OWN_OCCUPIED
2. Try and turn the entry into an integer
3. If the entry can be changed into an integer, enter a missing value
4. If the number can’t be an integer, we know it’s a string, so keep
going
Unexpected Missing Values
• In the code we’re looping through each entry in the “Owner
Occupied” column.
• To try and change the entry to an integer, we’re using int(row).
• If the value can be changed to an integer, we change the entry to
a missing value using Numpy’s np.nan.
• On the other hand, if it can’t be changed to an integer, we pass
and keep going.
• You’ll notice that I used try and except ValueError. This is called
exception handling, and we use this to handle errors.
• If we were to try and change an entry into an integer and it
couldn’t be changed, then a ValueError would be returned, and
the code would stop. To deal with this, we use exception handling
to recognize these errors, and keep going.
Unexpected Missing Values
Summarizing Missing Values
• After we’ve cleaned the missing values, we will probably
want to summarize them. For instance, we might want to
look at the total number of missing values for each feature.

to see if we have any missing values at all.

to get a total count of missing values.


Remove missing values
DataFrame.dropna(*, axis=0, how=_NoDefault.no_default,
thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)

Parameters:
axis {0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Only a single axis is allowed.
how {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at
least one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
Thresh int, optional
Require that many non-NA values. Cannot be combined with how.
Inplace bool, default False
Whether to modify the DataFrame rather than creating a new one.
Returns:
DataFrame or None : DataFrame with NA entries dropped from it or None if inplace=True.
Remove missing values
Inplace bool, default False
Whether to modify the DataFrame rather than creating a new one.

axis {0 or ‘index’, 1 or ‘columns’},


default 0
Determine if rows or columns
which contain missing values are
removed.
0, or ‘index’ : Drop rows which
contain missing values.
1, or ‘columns’ : Drop columns
which contain missing value.
Only a single axis is allowed.
Remove missing values
Thresh int, optional
Require that many non-NA values. Cannot be combined with how.
Remove missing values
how {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least
one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
Replacing
functions, we may fill in any null values in a dataset by replacing NaN
values with alternative values.

• fillna(),
• bfill()
• ffill()
• replace(),
• interpolate()
pandas.DataFrame.fillna
Fill NA/NaN values using the specified method.

DataFrame.fillna(value=None, *, axis=None, inplace=False,


limit=None, downcast=_NoDefault.no_default)
Value scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame


of values specifying which value to use for each index (for a Series) or
column (for a DataFrame).
Values not in the dict/Series/DataFrame will not be filled.
This value cannot be a list.

values = {"A": 0, "B": 1, "C": 2, "D": 3}

df.fillna(value=values)
pandas.DataFrame.fillna
Replacing
• want to fill in missing values with a single value.
pandas.DataFrame.ffill
Fill NA/NaN values by propagating the last valid observation to next valid.

• DataFrame.ffill(*, axis=None, inplace=False, limit=None,


downcast=_NoDefault.no_default)
Parameters:
axis : {0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series this parameter is unused and
defaults to 0.
inplace : bool, default False
If True, fill in-place. Note: this will modify any other views on this object (e.g., a
no-copy slice for a column in a DataFrame).
limit : int, default None
• If method is specified, this is the maximum number of consecutive NaN
values to forward/backward fill.
• In other words, if there is a gap with more than this number of consecutive
NaNs, it will only be partially filled.
• If method is not specified, this is the maximum number of entries along the
entire axis where NaNs will be filled. Must be greater than 0 if not None.
pandas.DataFrame.ffill
Fill NA/NaN values by propagating the last valid observation to next valid.

• DataFrame.ffill(*, axis=None, inplace=False, limit=None,


downcast=_NoDefault.no_default)

Parameters:
downcast : dict, default is None
A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which
will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Returns:
Series/DataFrame or None
Object with missing values filled or None if inplace=True
pandas.DataFrame.ffill
pandas.DataFrame.bfill
DataFrame.bfill(*, axis=None, inplace=False, limit=None,
downcast=_NoDefault.no_default)
Filling the Missing Values –
Imputation

The possible ways to do this are:


1.Filling the missing data with the mean or median value if it’s a numerical
variable.
2.Filling the missing data with mode if it’s a categorical value.
3.Filling the numerical value with 0 or -999, or some other number that will
not occur in the data. This can be done so that the machine can recognize
that the data is not real or is different.
4.Filling the categorical value with a new type for the missing values.
Filling the Missing Values –
Imputation
Replacing
• you might want to do a location based imputation. Here’s how
you would do that.
df.loc[2,'ST_NUM'] = 125

A very common way to replace missing values is using a median.


pandas.DataFrame.replace()
DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *,
inplace=False, limit=None, regex=False, method=_NoDefault.no_default)

• Replace values given in to_replace with value.


• Values of the Series/DataFrame are replaced with other values
dynamically.

to_replace: str, regex, list, dict, Series, int, float, or None


How to find the values that will be replaced. numeric, str or regex:
• numeric: numeric values equal to to_replace will be replaced with value
• str: string exactly matching to_replace will be replaced with value
• regex: regexs matching to_replace will be replaced with value
pandas.DataFrame.replace()
DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *,
inplace=False, limit=None, regex=False, method=_NoDefault.no_default)

• Replace values given in to_replace with value.


• Values of the Series/DataFrame are replaced with other values
dynamically.
to_replace: str, regex, list, dict, Series, int, float, or None
How to find the values that will be replaced. numeric, str or regex:
• numeric: numeric values equal to to_replace will be replaced with value
pandas.DataFrame.replace()
to_replace: str, regex, list, dict, Series, int, float, or None
str: string exactly matching to_replace will be replaced with value
pandas.DataFrame.replace()
to_replace: str, regex, list, dict, Series, int, float, or None
regex: regexs matching to_replace will be replaced with value
pandas.DataFrame.replace()
to_replace: str, regex, list, dict, Series, int, float, or None

regex: regexs matching to_replace will be replaced with value


pandas.DataFrame.replace()
DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *,
inplace=False, limit=None, regex=False, method=_NoDefault.no_default)

to_replace: str, regex, list, dict, Series, int, float, or None


dict:
• Different replacement values for different existing values.
• To use a dict in this way, the optional value parameter should not be given.
• For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’.
pandas.DataFrame.replace()
DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *,
inplace=False, limit=None, regex=False, method=_NoDefault.no_default)

to_replace: str, regex, list, dict, Series, int, float, or None


dict:
• Different values replaced in different columns.
• For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and
replaces these values with whatever is specified in value.
• The value parameter should not be None in this case.
pandas.DataFrame.replace()
DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *,
inplace=False, limit=None, regex=False, method=_NoDefault.no_default)

to_replace: str, regex, list, dict, Series, int, float, or None


dict:
• For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows:
• look in column ‘a’ for the value ‘b’ and replace it with NaN.
• The optional value parameter should not be specified to use a nested dict in this way.

• we can nest regular expressions as well. Note that column names (the top-level dictionary keys in a
nested dictionary) cannot be regular expressions.
interpolate() function
• Pandas dataframe.interpolate() function is basically
used to fill NA values in the dataframe or series.

• But, this is a very powerful function to fill the


missing values.

• It uses various interpolation technique to fill the


missing values rather than hard-coding the value.
pandas.DataFrame.interpolate(
)
DataFrame.interpolate(method='linear', *, axis=0, limit=None, inplace=False,
limit_direction=None, limit_area=None, downcast=_NoDefault.no_default, **kwargs)

Parameters :
method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,
‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’,
‘pchip’, ‘akima’}
axis : 0 fill column-by-column and 1 fill row-by-row.
limit : Maximum number of consecutive NaNs to fill. Must be greater than 0.
limit_direction : {‘forward’, ‘backward’, ‘both’}, default ‘forward’
limit_area : None (default) no fill restriction. inside Only fill NaNs surrounded by valid
values (interpolate). outside Only fill NaNs outside valid values (extrapolate). If limit is
specified, consecutive NaNs will be filled in this direction.
inplace : Update the NDFrame in place if possible.
downcast : Downcast dtypes if possible.
kwargs : keyword arguments to pass on to the interpolating function.

Returns : Series or DataFrame of same shape interpolated at the NaNs


pandas.DataFrame.interpolate(
)
DataFrame.interpolate(method='linear', *, axis=0, limit=None, inplace=False,
limit_direction=None, limit_area=None, downcast=_NoDefault.no_default, **kwargs)

Fill NaN values using an interpolation method.


• method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’,
‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’,
‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}
pandas.DataFrame.interpolate(
)

You might also like