Professional Documents
Culture Documents
to Machine Learning
Practical Session - 03
Steps of the process
01 Import Data
02 Clean the Data
03 Split the data to testing & Training
04 Design the model
05 Train the Model
06 Make Predictions
07 Evaluate and Improve
01
Import the dataset
Key Uses of Importing Datasets
1. Jupyter notebook
➢ Ensure that you have the dataset file available on your local machine.
➢ Use Python Pandas library to import the dataset into the notebook.
# Read CSV file into a DataFrame # Read Excel file into a DataFrame
dataframe = pd.read_csv('dataset.csv') dataframe = pd.read_excel('dataset.xlsx')
2. Kaggle
In Kaggle, datasets are often available and can be easily imported into
your notebooks below.
1. Link a kaggle dataset directly to kaggle notebook
2. Upload a dataset from your local computer to kaggle notebook
3. Google Colab
➢ For Google Colab, you can upload a dataset directly from your Google
Drive.
➢ you can mount your Google Drive and access the dataset file using the
following code.
02
Clean the data in the
dataset
Importance of Data cleaning
● Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset.
● This allows for filter accurate, defensible data from the dataset that
generates reliable visualizations, models, and business decisions.
Visit below link to get the code :
https://github.com/SanduNihara/MLProject.git
Methods of Data cleaning
1. Identify all missing / NaN values in a dataset
★ missing_value: A list that includes missing value types like "N/A", "na", and np.nan (which
represents the NaN value from the NumPy library).
★ df = pd.read_csv("..............", na_values=missing_value): This function reads the CSV file
located at a given path into a DataFrame named df. The na_values parameter is used to
specify the values in the CSV file that should be treated as missing values. In here, any
type of "N/A", "na", or np.nan in the file will be recognized as missing values in the
DataFrame.
★ df: The resulting DataFrame after reading the CSV file. The missing values specified in
missing_value will be represented as NaN in the DataFrame.
Summary : this code reads a CSV file into a DataFrame, treating specific values as missing
values (NaN) during the import process.
2. How to drop / delete rows that include NaN values in every column?
★ df.fillna(0): This function fills missing values in the DataFrame df with the
value specified, which is zero (0) in this example. The fillna() method is a
Pandas function that replaces NaN values with the specified fill value.
★ df_fillwithzero: The resulting DataFrame after filling the missing values
with zero. Any NaN values in the original DataFrame df will be replaced
with 0.
Summary : this code replaces any missing values in the DataFrame with zero,
creating a new DataFrame df_fillwithzero that contains zero as a substitute
for the missing values.
4. How to replace a NaN value with Forward filling function?