You are on page 1of 20

ITC2252 - Introduction

to Machine Learning
Practical Session - 03
Steps of the process
01 Import Data
02 Clean the Data
03 Split the data to testing & Training
04 Design the model
05 Train the Model
06 Make Predictions
07 Evaluate and Improve
01
Import the dataset
Key Uses of Importing Datasets

➢ Data Exploration and Preprocessing: Importing datasets allows to


understand and clean the data.
➢ Model Training and Evaluation: Imported datasets serve as the training
data for building and training machine learning models.
➢ Feature Engineering: Importing datasets enables the creation of
features to enhance model performance.
➢ Model Deployment and Inference: The imported dataset helps deploy
trained models for making predictions on new, unseen data.
How to import Dataset into a notebook file?

1. Jupyter notebook
➢ Ensure that you have the dataset file available on your local machine.
➢ Use Python Pandas library to import the dataset into the notebook.

import pandas as pd import pandas as pd

# Read CSV file into a DataFrame # Read Excel file into a DataFrame
dataframe = pd.read_csv('dataset.csv') dataframe = pd.read_excel('dataset.xlsx')
2. Kaggle
In Kaggle, datasets are often available and can be easily imported into
your notebooks below.
1. Link a kaggle dataset directly to kaggle notebook
2. Upload a dataset from your local computer to kaggle notebook
3. Google Colab
➢ For Google Colab, you can upload a dataset directly from your Google
Drive.
➢ you can mount your Google Drive and access the dataset file using the
following code.
02
Clean the data in the
dataset
Importance of Data cleaning
● Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset.
● This allows for filter accurate, defensible data from the dataset that
generates reliable visualizations, models, and business decisions.
Visit below link to get the code :

https://github.com/SanduNihara/MLProject.git
Methods of Data cleaning
1. Identify all missing / NaN values in a dataset

★ missing_value: A list that includes missing value types like "N/A", "na", and np.nan (which
represents the NaN value from the NumPy library).
★ df = pd.read_csv("..............", na_values=missing_value): This function reads the CSV file
located at a given path into a DataFrame named df. The na_values parameter is used to
specify the values in the CSV file that should be treated as missing values. In here, any
type of "N/A", "na", or np.nan in the file will be recognized as missing values in the
DataFrame.
★ df: The resulting DataFrame after reading the CSV file. The missing values specified in
missing_value will be represented as NaN in the DataFrame.
Summary : this code reads a CSV file into a DataFrame, treating specific values as missing
values (NaN) during the import process.
2. How to drop / delete rows that include NaN values in every column?

★ df: Refers to the DataFrame on which the function is called.


★ dropna(): A function in Pandas used to drop rows with missing values.
★ how="all": Specifies the condition for dropping rows. In here, it means that a
row will only be dropped if all the values in that row are missing (NaN).
df.dropna(how="all") helps to eliminate rows from a DataFrame that have missing
values in every column, ensuring that only rows with at least one non-missing value
remain.
3. How to replace a NaN value with a specific value?

★ df.fillna(0): This function fills missing values in the DataFrame df with the
value specified, which is zero (0) in this example. The fillna() method is a
Pandas function that replaces NaN values with the specified fill value.
★ df_fillwithzero: The resulting DataFrame after filling the missing values
with zero. Any NaN values in the original DataFrame df will be replaced
with 0.
Summary : this code replaces any missing values in the DataFrame with zero,
creating a new DataFrame df_fillwithzero that contains zero as a substitute
for the missing values.
4. How to replace a NaN value with Forward filling function?

★ df.fillna(method='ffill'): This function fills the missing values in the


DataFrame df with the values from the previous non-missing entry,
propagating the last observed value forward. The method parameter is
set to 'ffill' to specify the forward fill method.
★ df_forwardfilled: The resulting DataFrame after applying the forward fill
method to fill in the missing values. The missing values in df will be
replaced with the preceding non-missing values.
Summary : the code replaces missing values in the DataFrame df with the
most recent non-missing value observed before them, resulting in a new
DataFrame named df_forwardfilled.
5. How to replace a NaN value with Backward filling function?

★ df.fillna(method='bfill'): This function, fillna(), is used to fill missing values


in the DataFrame (df). The parameter method='bfill' specifies the
backward filling method, which means that missing values will be filled
with the next available value from the same column, effectively carrying
the previous non-missing value forward.
★ df_backwardfilled: The resulting DataFrame after performing the
backward filling operation. Missing values in the original DataFrame (df)
have been replaced with the next available values from the same column.
Summary : this code fills the missing values in a DataFrame by replacing
them with the next available values from the same column, using backward
filling. The resulting Data Frame is stored in df_backwardfilled.
6. How to replace a NaN value with Interpolate function?

★ df_nulldropped: Refers to the DataFrame from which missing values have


been previously dropped.
★ interpolate(): A function in Pandas that performs interpolation, which is a
technique used to estimate missing values based on the surrounding
data points. It computes intermediate values to fill in the gaps caused by
missing values.
★ df_interpolate: The resulting DataFrame after applying interpolation on
df_nulldropped. It contains the original data with the missing values filled
in by the interpolated values.
In summary, the code uses interpolation to estimate and fill in missing values
in a DataFrame, resulting in a new DataFrame named df_interpolate.
Click here : Sample dataset
Thanks
Do you have any questions?

You might also like