You are on page 1of 8

Pandas

Pandas is an open-source Python Library / Module providing high performance and data
manipulation and Analysis Tool. The word Pandas derived from Panel Data. The pandas concept
developed by WES McKinney in the year 2008.

Data Structures used in Pandas

a) Series: It is a One-Dimensional Labelled Array Capable of Storing / Holding Homogeneous


data of any type (Integer, String, float,.........Python objects etc).

Creating an Series:

Syntax:- varname=pandas.Series(object, index, dtype)

Ex:
import pandas as a 0 10
lst=[10,20,30,40] 1 20
s=a.Series(lst)
print(s) 2 30
3 40
dtype: int64
Creating an Series object with Programmer-defined Index:

Ex:
import pandas as pd Stno 10
lst=[10,"Rossum",34.56] Name Rossum
s=pd.Series(lst,index=["Stno","Name","Marks"])
print(s) Marks 34.56
dtype: object
Creating a Series object from dict:

Ex:
import pandas as pd sub1 Python
d1={"sub1":"Python","sub2":"Java"} sub2 Java
s=pd.Series(d1)
print(s) sub3 Data Science
sub4 ML
dtype: object

Attributes and methods on Series:

 Attribute returns information of object


 Attributes do not modify or manipulate the object

import pandas as pd
a=["sreenu","varshini",2,3,4,5]
b=pd.Series(a)
print(b)
items 0 sreenu
print(b.items) 1 varshini
2 2
3 3
4 4
5 5
dtype: object>
Values Print(b.values) ['sreenu' 'varshini' 2 3 4 5]
Index Print(b.index) RangeIndex(start=0, stop=6, step=1)
Dtype Print(b.dtype) Object
Shape Print(b.shape) (6,)
Size Print(b.size) 6
Array Print(b.array) <PandasArray>
['sreenu', 'varshini', 2, 3, 4, 5]
Length: 6, dtype: object
Ndim Print(b.ndim) 1

Methods:

 A method modify or manipulate an object


 It represents behaviour of an object

import pandas as pd
b=[1,2,3,4,5]
a=pd.Series(b)
print(a)
sum Adding all the values Print(a.sum()) 15
product Multiple all values in each column Print(a.product()) 120
and return product for each column
mean Print(a.mean()) 3.0
Median Print(a.median()) 3.0
count Print(a.count()) 5
describe Print(a.describe()) count 5.000000
mean 3.000000
std 1.581139
min 1.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 5.000000
dtype: float64

Parameters and arguments:

import pandas as pd
a=["apple","mango","grape"]
b=["a","b","c"]
print(pd.Series())
print(pd.Series(a,b)) a apple
b mango
c grape
dtype: object
import pandas as pd apple a
a=["apple","mango","grape"] mango b
b=["a","b","c"] grape c
print(pd.Series(data=b,index=a)) dtype: object

b) DataFrame: DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a


table with rows and columns.
import pandas as pd 0
d1={"sub1":"Python","sub2":"Java"} sub1 Python
s=pd.Series(d1)
a=pd.DataFrame(s) sub2 Java
print(a)
Number of approaches to create DataFrame

Creating an object DataFrame by Using list / tuple:


import pandas as pd 0
lst=[10,20,30] 0 10
s=pd.Series(lst)
a=pd.DataFrame(s) 1 20
print(a) 2 30

import pandas as pd 0 1 2 3
lst=[[10,20,30,40],["RS","JS","MCK","TRV"]] 0 10 20 30 40
df=pd.DataFrame(lst)
print(df) 1 RS JS MCK TRV

import pandas as pd 0 1
lst=[[10,'RS'],[20,'JG'],[30,'MCK'],[40,'TRA']] 0 10 RS
df=pd.DataFrame(lst)
print(df) 1 20 JG
2 30 MCK
3 40 TRA
Creating an object DataFrame by Using dict object:

Ex:
import pandas as pd
dictdata={"Names":["Rossum","Gosling","McKinney"],"Subjects":
["Java","C","Pandas"],"Ages":[80,85,55] }
df=pd.DataFrame(dictdata)
print(df)

Names Subjects Ages


0 Rossum Java 80
1 Gosling C 85
2 McKinney Pandas 55

Creating an object DataFrame by Using Series object:

Ex:
import pandas as pd 0
sdata=pd.Series([10,20,30,40]) 0 10
df=pd.DataFrame(sdata)
print(df) 1 20
2 30
3 40
Creating an object DataFrame by Using ndarray object:

Ex:
import numpy as np 0 1
import pandas as pd 0 10 60
l1=[[10,60],[20,70],[40,50]]
a=np.array(l1) 1 20 70
df=pd.DataFrame(a) 2 40 50
print(df)

Misc Operations on DataFrame:

Ex:
import pandas as pd First Second
data={"First":[10,20,30,40],"Second":[1.4,1.3,1.5,2.5]} 10 1.4 0
df=pd.DataFrame(data)
print(df) 20 1.3 1
30 1.5 2
40 2.5 3
import pandas as pd First Second Third
data={"First":[10,20,30,40],"Second":[1.4,1.3,1.5,2.5]} 0 10 1.4 11.4
df=pd.DataFrame(data)
df["Third"]=df["First"]+df["Second"] 1 20 1.3 21.3
print(df) 2 30 1.5 31.5
3 40 2.5 42.5
Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns. Pandas
use the loc attribute to return one or more specified row(s)
import pandas as pd calories 420
data = { duration 50
"calories": [420, 380, 390],
"duration": [50, 40, 45] Name: 0, dtype: int64
}
df = pd.DataFrame(data)
print(df.loc[0])

By using CSV File(Comma Separated Values): CSV files must be saved on some file name with an
extension .csv ( internally treated as excel sheet ). CSV files stores Tabular data (Numbers and text)
in plain text.

Ex:
import pandas as a
df=a.read_csv("C:\king\Book1.csv")
print(df)

Cleaning Empty Cells

Empty Cells: Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows: One way to deal with empty cells is to remove rows that contain empty cells. This is
usually OK, since data sets can be very big, and removing a few rows will not have a big impact on
the result.
import pandas as pd
df = pd.read_csv(' c:\king\data.csv')
df.dropna(inplace = True)
print(df.to_string())

Find length of rows

import pandas as a
df=a.read_csv("C:\king\Book1.csv")
print(len(df))

View top rows

import pandas as a It will return top 5 rows automatically


df=a.read_csv("C:\king\Book1.csv")
b=df.head()
print(b)
import pandas as a It will return with custom input to get top n no
df=a.read_csv("C:\king\Book1.csv") of rows
b=df.head(n=10)
print(b)

View bottom rows

import pandas as a Bottom 5 rows automatically


df=a.read_csv("C:\king\Book1.csv")
b=df.tail()
print(b)
import pandas as a With custom input to get bottom n no of rows
df=a.read_csv("C:\king\Book1.csv")
b=df.tail(n=10)
print(b)

Export DataFrame to Csv file

import pandas as a
name=["a","b","c"]
scr=[90,40,80]
b={"name":name,"score":scr}
df=a.DataFrame(b)
c=df.to_csv("C:\king\Book2.csv")#we can give path to save file
print(c)

Inplace parameter using sort

import pandas as a
df=a.read_csv("C:/king/Book1.csv", usecols=["name"], squeeze=True)
b=df.sort_values()
print(b)

Information of file

import pandas as a
df=a.read_csv("C:/king/Book1.csv")
print(df.info() )

Cleaning data

Accesssing the Data of DataFrame

=======================================================

1) DataFrameobj.head(no.of rows)

2) DataFrameobj.tail(no.of rows)

3) DataFrameobj.describe()

4) DataFrameobj.shape

5) DataFrameobj[start:stop:step]

6) DataFrameobj["Col Name"]

7) DataFrameobj[ ["Col Name1","Col Name-2"...."Col Name-n"] ]

8) DataFrameobj[ ["Col Name1","Col Name-2"...."Col Name-n"]] [start:stop:step]

9) DataFrameobj.iterrows()

===================================================

Understabding loc() ----- here start and stop index Included and

Col Names can be used(but not column numbers]

--------------------------------------------------------------------------------------

1) DataFrameobj.loc[row_number]

2) DataFrameobj.loc[row_number,[Col Name,.........] ]

3) DataFrameobj.loc[start:stop:step]

4) DataFrameobj.loc[start:stop:step,["Col Name"] ]
5) DataFrameobj.loc[start:stop:step,["Col Name1", Col Name-2......."] ]

6) DataFrameobj.loc[start:stop:step,"Col Name1" : Col Name-n"]

------------------------------------------------------------------------------------------------------------

Understabding iloc() ----- here start index included and stop index excluded and Col Numbers must
be used(but not column names]

--------------------------------------------------------------------------------------

1) DataFrameobj.iloc[row_number]

2) DataFrameobj.iloc[row_number,Col Number.........]

3) DataFrameobj.iloc[row_number,[Col Number1,Col Number2............] ]

3) DataFrameobj.iloc[row start:row stop, Col Start: Col stop]

4) DataFrameobj.iloc[row start:row stop,Col Number ]

5) DataFrameobj.iloc[ [row number1, row number-2.....] ]

6) DataFrameobj.iloc[ row start: row stop , [Col Number1,Col Number2............] ]

6) DataFrameobj.iloc[ : , [Col Number1,Col Number2............] ]

=======================================================================

Adding Column Name to Data Frame

=======================================================================

1) dataframeobj['new col name']=default value

2) dataframeobj['new col name']=expression

=======================================================================

Removing Column Name from Data Frame

=======================================================================

1)dataframe.drop(columns="col name")

2)dataframe.drop(columns="col name",inplace=True)

=======================================================================

sorting the dataframe data

=======================================================================

1) dataframeobj.sort_values("colname")

2) dataframeobj.sort_values("colname",ascending=False)

3) dataframeobj.sort_values(["colname1","col name2",...col name-n] )


=======================================================================

knowing duplicates in dataframe data

=======================================================================

1) dataframeobj.duplicated()---------------gives boolean result

=======================================================================

Removing duplicates from dataframe data

=======================================================================

1) dataframeobj.drop_duplicates()

2) dataframeobj.drop_duplicates(inplace=True)

=======================================================================

Data Filtering and Conditional Change

=======================================================================

1) dataframeobj.loc[ simple condition]

Ex: df.loc[ df["maths"]>75 ]

2) dataframeobj.loc[ compund condition]

Ex: df.loc[ (df["maths"]>60) & (df["maths]<85) ]

Ex: df.loc[ (df["percent"]>=60) & (df["percent"]<=80),["grade"]]="First" # cond updattion.

Special Case:

3) dataframeobj.loc[simple condition.str.contains(str)]

4) dataframeobj.loc[simple condition.str.startswith(str)]

5) dataframeobj.loc[simple condition.str.endswith(str)]

You might also like