You are on page 1of 3

NAME- Avinash Tiwari

ROLL NO.- 2100290110041


DATA ANALYTICS LAB 6
EXP 6: To perform data pre-processing operation 1) Handling Missing data 2)
Min-Max normalization

CODE:
import pandas as pd import numpy as np from

sklearn.preprocessing import MinMaxScaler

# Example DataFrame data = {'Salary': [50000, np.nan,

60000, np.nan, 70000]} df = pd.DataFrame(data)

# Handling Missing Data

# Method 1: Fill missing values using the previous valid value

df['Salary'] = df['Salary'].fillna(method='pad')

# Method 2: Replace NaN values with 0

# df['Salary'] = df['Salary'].replace(to_replace=np.nan, value=0)

# Method 3: Interpolate missing values linearly

# df['Salary'] = df['Salary'].interpolate(method='linear', direction='forward')

# Min-Max Normalization

trans = MinMaxScaler()

df['Salary_normalized'] =
trans.fit_transform(df[['Salary'

]])

print(df)

THEORY:

Handling Missing Data:

This code will handle missing data in the 'Salary' column using the fillna() method with the
'pad' method for forward filling. If you want to use another method like replacing NaN with
0 or linear interpolation, you can comment/uncomment the respective lines.

Missing data can occur due to various reasons such as data entry errors, equipment
malfunction, or intentional omission.

It's essential to handle missing data appropriately because they can lead to biased results
and incorrect conclusions if not addressed.

Common strategies for handling missing data include:

Imputation: Filling in missing values with estimated values. Methods include filling with
mean, median, mode, or using more sophisticated techniques like linear interpolation.

Deletion: Removing rows or columns with missing values. However, this approach can
lead to loss of valuable information if not done carefully.

Prediction: Using machine learning algorithms to predict missing values based on other
features in the dataset.

In Python, libraries like pandas provide convenient functions like fillna() and interpolate()
for handling missing data effectively.

Min- Max Normalization:

For Min-Max normalization, the code initializes a MinMaxScaler object and fits it to the
'Salary' column, transforming the data and storing the normalized values in a new column
called 'Salary_normalized'.
Min-Max normalization, also known as feature scaling, is a technique used to scale
numeric features to a specific range, typically between 0 and 1.

The formula for Min-Max normalization is:

\text{X_normalized} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}

Here,

X is the original value,

min X

min

is the minimum value of the feature, and

max

X max

is the maximum value of the feature.

Min-Max normalization ensures that all features have the same scale, which can be crucial
for algorithms sensitive to feature scales, such as gradient descent-based optimization
algorithms.

In Python, libraries like scikit-learn provide the MinMaxScaler class, which makes it easy
to perform Min-Max normalization on datasets.

By understanding and applying these data pre-processing techniques in Python, you


can ensure that your data is clean, standardized, and suitable for further analysis or
machine learning tasks.

You might also like