Section 8 - Data Preprocessing

You might also like

You are on page 1of 14

Section 8

Data
Preprocessing
Programming 2
Get the data
• From kaggle
https://www.kaggle.com/datasets/andrewmvd/udemy-courses
Step 1: upload csv file
import pandas as pd
df = pd.read_csv('udemy_courses.csv’)
print(df)
Step 2: Check the datatype for each column
import pandas as pd
df = pd.read_csv('udemy_courses.csv’)
print(df.dtypes)
Step 3: Print File Statistics
import pandas as pd
df = pd.read_csv('udemy_courses.csv’)
print(df.describe())
Step 4: Finding the missing values in each
column
import pandas as pd
df = pd.read_csv('udemy_courses2.csv’)
print(df.isna().sum())
Step 5: Dropping/Deleting the rows containing the
missing values then and Checking to see the NA's
after deletion of the rows.
import pandas as pd
df = pd.read_csv('udemy_courses2.csv’)
df= df.dropna()
print(df.isna().sum())
Step 6: Checking Duplicates
import pandas as pd
df = pd.read_csv('udemy_courses2.csv’)
print(df.duplicated())
Step 7: Removing Duplicates
import pandas as pd
df = pd.read_csv('udemy_courses2.csv’)
df.drop_duplicates(inplace = True)
print(df)
How to create a Data frame?
A DataFrame: is a 2-dimensional labeled data structure in Python that is part of the Pandas library.

It is similar to a table in a relational database or a spreadsheet in Excel.

import pandas as pd
data = { "name": ['ahmed', 'esraa', 'ali','asmaa'],
"age": [21, 20, 19,21] }
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
How to create a Data frame?
import pandas as pd
data = [['ahmed',21],['esraa',20],['ali',19],['asmaa',21]]
# load data into a DataFrame object:
df = pd.DataFrame(data,columns=['name','age'])
print(df)
How to Print Specific numbers of rows in csv
file
import pandas as pd
df = pd.read_csv('udemy_courses.csv')
print(df.head(15))
How to Print Specific columns in csv file?

One column
import pandas as pd
data = pd.read_csv('udemy_courses.csv')
print(data['subject'])
How to Print Specific columns in csv file?
Multiple columns
import pandas as pd
df = pd.read_csv('udemy_courses.csv')
print(df[['level','course_title']])

You might also like