You are on page 1of 10

Data Frame?

• used to represent data in the form of rows and


columns
• Data can be from a file, excel spreadsheet,
sequence in Python(lists and tuples) and
dictionaries
• After storing data in the frame, various
operations can be done to analyze and
understand it.
• ‘Pandas’ package in Python is used for data
analysis and manipulation.
• Pandas – name derived from ‘panel data’ –
multidimensional data.
• Pandas deals with three data structures namely
Series, Dataframe and Panel. These are faster
than Numpy arrays.
• Series – 1Dal object. Homogeneous data. Size is
immutable. Values are mutable.
10 23 56 17 52 61 73 90 26 72

2
• Dataframes : 2Dal object with heterogeneous
data. Size is mutable and Data also mutable.
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78

• Panel : a 3Dal data structure with heterogeneous


data.

3
Creating Data Frames from .csv files
• Create an excel file and store the following data
• Save the file empdata.csv extension
• Type the following in Jupyter Notebook
import pandas as pd
df = pd.read_csv("C:\Users\Admin\Desktop\PU I
Sem 2019-2020\CSE 317 Prog in Python\Lecture
Slides/empdata.csv")
df
Operations on Data Frame
• To retrieve a range of rows
>>df [2:5]
>>df [: : 2]
• To retrieve column names
>>df.columns
• To retrieve column data
>>df.Empid
>>df[“Empid”]
Operations on Data Frame
• To retrieve data from multiple columns
>>df[[“Empid”, “Ename”]]
• To find minimum and maximum values of a
column
>>df[“Salary”].max()
>>df[“Salary”].min()
• To display statistical information
>>df.describe()
Queries on Data
• To display the details of the employees whose
salary is greater than 20000
>>df[df.Salary > 20000]
• To display only the Empid and Names of the
employees whose salary is greater than 20000
>>df[[“Empid”, “Ename”]] [df.Salary > 20000]
• To get the details of the highest paid employee
>>df[df.Salary == df.Salary.max()]
Sorting Data
• Change DOJ to date type
>>df = pd.read_csv(“File Path”, parse_dates =
[“DOJ”])
>>print(df)
• Sort in ascending order of DOJ and store in data
frame df1
>>df1 = df.sort_values(“DOJ”)
>>df1
To sort in descending order of DOJ
>>df1 = df.sort_values(“DOJ”, ascending = False)
Sorting on Multiple Columns
Sorting on DOJ in descending order and in that
sort on “Salary” in ascending order
>>df1 = df.sort_values(by = [“DOJ”, “Salary”],
ascending = [False, True])
Filling in Missing value – Data Cleansing
• Use fillna() to replace the NaN values by a
specified value
>>df1 = df.fillna(0)
• To fill missing values in each column by a specific
value
>>df1 = df.fillna({“Ename” : “Name is Missing”,
”Salary” : 0.0, “DOJ” : “00-00-00”})
>>df1
• To drop those rows with missing values
>>df1 = df.dropna()
>>df1

You might also like