You are on page 1of 27

CHAPTER – 5 – DATA HANDLING USING PANDAS

Introduction - Pandas is an open-source Python Library providing high-performance data manipulation


and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data.
In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible
tool for analysis of data.
Key Features of Pandas:
 Dataframe object help a lot in keeping track of our data.
 With a pandas dataframe, we can have different data types (float, int, string, datetime, etc) all in one
place.
 Pandas has built in functionality for like easy grouping & easy joins of data.
 Good IO capabilities; easily pull data from a MySQL database directly into a data frame.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of data sets.
 Label-based slicing, indexing and subsetting of large data sets.
Introduction to Pandas Data Structures:
Series < DataFrame < Panel
The best way to think of these data structures is that the higher dimensional data structure is a container of
its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container
of DataFrame.

Data Structure Dimensions Description

Series 1 1D labelled homogeneous array, size immutable.

2D labelled, size-mutable tabular structure with potentially


Data Frames 2
heterogeneously typed columns.

Panel 3 3D labelled, size-mutable array.

Mutability - All Pandas data structures are value mutable (can be changed).
Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.
Series - Series is a one-dimensional array like structure with homogeneous data. For example, the following
series is a collection of integers 10, 23, 56 …
10 23 56 17 52 61 73 90 26 72

Key Points:
 Homogeneous data
 Size Immutable
 Values of Data Mutable
A pandas Series can be created using the following constructor –
pandas.Series(data, index, dtype, copy)

Parameter & Description

data
data takes various forms like ndarray, list, constants

index
Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is
passed.

dtype
dtype is for data type. If None, data type will be inferred

copy
Copy data. Default False

Creating Pandas Series – A series can be created using various inputs like −
Array, Dict, Scalar value or constant, Mathematical Operations etc.
a) Create an Empty Series: An empty series can be created using Series() function of pandas library.
Example - #import the pandas library as pd
import pandas as pd
s = pd.Series()
print s
Output - Series([], dtype: float64)
b) Creating a series from Lists: In order to create a series from list, we have to first create a list after that
we can create a series from list.
Example 1 - # import the pandas library as pd
import pandas as pd
simple_list = ['g', 'e', 'e', 'k', 's']
s = pd.Series(simple_list)
print s
Output -

c) Create a Series from ndarray - If data is an ndarray, then index passed must be of the same length. If
no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3….
range(len(array))-1].
Example 1 - #import the NUMPY & PANDAS as np & pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s
Output –

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-
1, i.e., 0 to 3.
Example 2 - #import the NUMPY & PANDAS as np & pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Output -
We passed the index values here. Now we can see the customized indexed values in the
output.
d) Create a Series from dictionaries - A dict can be passed as input and if no index is specified, then the
dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.
Example 1 - #import the NUMPY & PANDAS as np & pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Output -

Observe − Dictionary keys are used to construct index.


Example 2 - #import the NUMPY & PANDAS as np & pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s
Output -

Observe − Index order is persisted and the missing element is filled with NaN (Not a Number).
e) Create a Series from Scalar (Single Item) - If data is a scalar value, an index must be provided. The
value will be repeated to match the length of index
Example - #import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s
Output -

f) Create a Series from Mathematical Operations – Different types of mathematical operators (+, -, *, /
etc.) can be applied on pandas series to generate another series.
Example - # import the pandas library as pd
import pandas as pd1
s = pd1.Series([1,2,3])
t = pd1.Series([1,2,4])
u = s + t #addition operation
print(u)
u = s * t # multiplication operation
print(u)
Output -
Series Basic Functionality:

Attribute or Method & Description

axes
Returns a list of the row axis labels

dtype
Returns the dtype of the object.

empty
Returns True if series is empty.

ndim
Returns the number of dimensions of the underlying data, by definition 1.

size
Returns the number of elements in the underlying data.

values
Returns the Series as ndarray.

head()
Returns the first n rows. Maximum default value of n is 5.

tail()
Returns the last n rows. Maximum default value of n is 5.

Example - import pandas as pd


import numpy as np
#Create a series with range 10 to 60 with step 10
s = pd.Series(np.arange(10, 60, 10))
print s
print ("The axes are:" + s.axes)
print ("Is the Object empty?" + s.empty)
print ("The dimensions of the object:" + s.ndim)
print ("The size of the object:" + s.size)
print ("The actual data series is:" + s.values)
print ("The first two rows of the data series:" +
s.head(2))
print ("The last two rows of the data series:" +
s.tail(2))
Its output is as follows −

Accessing Data from Series with Position - Data in the series can be accessed similar to that in an
ndarray.
Example 1 – Retrieve the first element. As we already know, the counting starts from zero for the array,
which means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first element
print s[0]

Output –

Example 2 – Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that
index onwards will be extracted. If two parameters (with : between them) is used, items between the two
indexes (not including the stop index).
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first three element
print s[:3]
utput –

Example 3 – Retrieve the last three elements.


import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the last three element
print s[-3:]
Output –

Retrieve Data Using Label (Index) – A Series is like a fixed-size dict in that you can get and set values by
index label.
Example 1 – Retrieve a single element using index label value.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve a single element
print s['a']
Output –

Example 2 – Retrieve multiple elements using a list of index label values.


import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elements
print s[['a','c','d']]
Output –
Example 3 – If a label is not contained, an exception is raised.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elements
print s['f']
Output –

DataFrame - A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns.
Structure –

You can think of it as an SQL table or a spreadsheet data representation.


Key Points:
 Homogeneous data
 Size Immutable
 Values of Data Mutable
pandas.DataFrame - A pandas DataFrame can be created using the following constructor –
pandas.DataFrame( data, index, columns, dtype, copy)
Parameter & Description

data
data takes various forms like ndarray, series, map, lists, dict, constants and also another
DataFrame.

index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if
no index is passed.

columns
For column labels, the optional default syntax is - np.arange(n). This is only true if no index is
passed.

dtype
Data type of each column.

copy
This command (or whatever it is) is used for copying of data, if the default is False.

Creating DataFrames – A pandas DataFrame can be created using various inputs like –
Lists, dict, Series, Numpy ndarrays, Another DataFrame
a) Create an Empty DataFrame - A basic DataFrame, which can be created is an Empty Dataframe.
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df
Output –

b) Create a DataFrame from Lists - The DataFrame can be created using a single list or a list of lists.
Example 1 –
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
Output –
Example 2 –
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Output -

Example 3 –
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Output -

Observe, the dtype parameter changes the type of Age column to floating point.
c) Create a DataFrame from Dict of ndarrays / Lists - All the ndarrays must be of same length. If index
is passed, then the length of the index should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1 –
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Output –

Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).
Example 2 – Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Output –

Observe, the index parameter assigns an index to each row.


d) Create a DataFrame from List of Dicts – List of Dictionaries can be passed as input data to create a
DataFrame. The dictionary keys are by default taken as column names.
Example 1 –
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
Output –

Example 2 – The following example shows how to create a DataFrame by passing a list of dictionaries
and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
Output –

Example 3 – The following example shows how to create a DataFrame with a list of dictionaries, row
indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys


df1 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b1'])
print df1
print df2
Output –

Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the
NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.
e) Create a DataFrame from Dictionary of Series – Dictionary of Series can be passed to form a
DataFrame. The resultant index is the union of all the series indexes passed.
Example –
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df
Output –

f) Create DataFrame from CSV files –


What is CSV (Comma Separated Value) file?
A CSV file is nothing more than a simple text file. However, it is the most common, simple,
and easiest method to store tabular data. This particular format arranges tables by following a
specific structure divided into rows and columns. It is these rows and columns that contain your data.
A new line terminates each row to start the next row. Similarly, a comma, also known as the delimiter,
separates columns within each row.
Take the following table as an example:

Now, the above table will look as follows if we represent it in CSV format:
Observe, a comma separates all the values in columns within each row. However, you can use other
symbols such as a semicolon (;) as a separator as well.
The two workhorse functions for reading text files (or the flat files) are read_csv() and
read_table(). They both use the same parsing code to intelligently convert tabular data into a
DataFrame object –
read_csv – This function supports reading of CSV files and save into a DataFrame object. Below
are the examples demonstrating the different parameters of read_csv function
Here is temp.csv file data we will use in all examples further –
S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900
Example – Reading temp.csv file using Pandas
import pandas as pd
df=pd.read_csv("temp.csv")
print df
Output -

 sep - If the separator between each field of your data is not a comma, use the sep argument.
Example – Consider the data is separated in the above given temp.csv file by dollar ($) symbol
instead of comma (,) as below:
S.No$Name$Age$City$Salary
1$Tom$28$Toronto$20000
2$Lee$32$HongKong$3000
3$Steven$43$Bay Area$8300
4$Ram$38$Hyderabad$3900
Program -
import pandas as pd
data=pd.read_csv("temp.csv", sep='$')
print(data)
Output –

 delimiter - The delimiter argument of pandas read_csv function is same as sep.


 header - to specify which line in your data is to be considered as header (heading for each
column).
Program -
import pandas as pd
data=pd.read_csv("temp.csv", header='2')
print(data)
Output –

Note - If your csv file does not have header, then you need to set header = None while
reading it.
 names - Use the names attribute if you would want to specify column names to the
dataframe explicitly.
Program –
import pandas as pd
col_names = ['Serial No', 'Person Name', 'Age', 'Lives
in', 'Earns']
data = pd.read_csv("temp.csv", names = col_names,
header = None)
print(data)
Output –
 index_col - Use this argument to specify the row labels to use. Column name or number
both can be mentioned as index column.
Program –
import pandas as pd
df=pd.read_csv("temp.csv",index_col=['S.No'])
print df
Output –

 pandas read_csv usecols – To load specific columns into dataframe.


Program –
import pandas as pd
df=pd.read_csv("temp.csv", usecols = [0, 3])
print df
Output –

 prefix - When a data set doesn’t have any header , and you try to convert it to dataframe by
(header = None), pandas read_csv generates dataframe column names automatically with
integer values 0,1,2,…
If we want to prefix each column’s name with a string, say, “COLUMN”, such that dataframe
column names will now become COLUMN0, COLUMN1, COLUMN2 etc. we use prefix
argument.
 pandas read_csv dtype - Use dtype to set the datatype for the data or dataframe columns.
Program –
import pandas as pd
df = pd.read_csv("temp.csv", dtype={'Salary':
np.float64})
print df.dtypes
Output –

 true_values , false_values – Suppose your dataset contains Yes and No string which you
want to interpret as True and False.
 skiprows - skiprows skips the number of rows specified.
import pandas as pd
df=pd.read_csv("temp.csv", skiprows=2)
print df
Output -

 skipfooter - Indicates number of rows to skip from bottom of the file.


 nrows - If you want to read a limited number of rows, instead of all the rows in a dataset, use
nrows.
Program –
import pandas as pd
df=pd.read_csv("temp.csv", nrows=2)
print df

 skip_blank_lines - If skip_blank_lines option is set to False, then wherever blank lines are
present, NaN values will be inserted into the dataframe. If set to True, then the entire blank
line will be skipped.
 parse_dates - We can use pandas parse_dates to parse columns as datetime. You can
either use parse_dates = True or parse_dates = [‘column name’]
 iterator - Using the iterator parameter, you can define how much data you want to read , in
each iteration by setting iterator to True.
Program –
import pandas as pd
df=pd.read_csv("temp.csv", iterator = True)
print("first", data.get_chunk(2))
print("second", data.get_chunk(2))
Output –

g) Create DataFrame from Text files – Pandas built-in function pandas.read_table() method is
used to read Text files and store in DataFrame object.
Difference between CSV file and Text file:
TEXT FILE DATA – Separated by Tab (\t) CSV FILE DATA – Separated by Comma (,)
S.No Name Age City Salary S.No,Name,Age,City,Salary
1 Tom 28 Toronto 20000 1,Tom,28,Toronto,20000
2 Lee 32 HongKong 3000 2,Lee,32,HongKong,3000
3 Steven 43 Bay Area 8300 3,Steven,43,Bay Area,8300
4 Ram 38 Hyderabad 3900 4,Ram,38,Hyderabad,3900
read_table() to read the data read_csv() to read the data

Important Note – read_table() function supports all parameters which are supported
by read_csv() function.
Operations on DataFrame columns –
a) Column Selection – To select a column from DataFrame, we can simply pass column label into the
DataFrame object as given below:
Example –
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df ['one']

Output –
b) Column Addition – To add a column in existed DataFrame, we can simply pass column label and
elements into the DataFrame object as given below:
Example -
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with


column label by passing new series

print ("Adding a new column by passing as Series:")


df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df

print ("Adding a new column using the existing columns in


DataFrame:")
df['four']=df['one']+df['three']

print df

Output –

c) Column Deletion –
Example –
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print df

# using del function


print ("Deleting the first column using DEL function:")
del df['one']
print df

# using pop function


print ("Deleting another column using POP function:")
df.pop('two')
print df

Output –

Operations on DataFrame rows –


a) Row Selection – Rows can be selected by passing row label to a loc function.
Example –
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.loc['b']

Output –

Selection by integer location – Rows can be selected by passing integer location to an iloc function.
Example –
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.iloc[2]

Output –

Selection by slicing – Multiple rows can be selected using ‘ : ’ operator.


Example –
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df[2:4]

Output –

b) Row Addition – New rows in existed DataFrame can be added using append function. This function
will append the rows at the end.
Example –
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])


df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print df

Output –

c) Row Deletion – We can use index label to delete or drop rows from a DataFrame. If label is duplicated,
then multiple rows will be dropped.
Observe, in the above example, the labels 0 and 1 are duplicate.
Example –
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])


df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0


df = df.drop(0)

print df

Output –

Observe, two rows were dropped because those two contain the same label 0.
Renaming DataFrame Rows/Columns index labels – Pandas rename() method is used to rename any
index, column or row.
Syntax - DataFrame.rename(mapper=None, index=None, columns=None,
axis=None, inplace=False)
Parameters:
mapper, index and columns: Dictionary value, key refers to the old name and value refers to
new name. Only one of these parameters can be used at once.
axis: int or string value, 0/’row’ for Rows and 1/’columns’ for Columns. Default is 0.
inplace: Makes changes in original Data Frame if True. Default is false.

-----CSV_FILE_DATA USED FOR EXAMPLES BELOW-------


ADM_NO, STUDENT_NAME, AGE, CLASS, SECTION
101024, RAHUL SINGH, 15, 10, C
101025, SHITAL KUMARI, 16, 10, B
101026, UJJWAL SHARMA, 15, 10, A
-----------------------------------------------------------------------
DataFrame made using above CSV file is:

Example 1 – Changing multiple row labels/names


# importing pandas module
import pandas as pd
# making data frame from csv file
data = pd.read_csv("DATA.CSV", index_col ="ADM_NO" )
# changing rows indexes/names/labels with rename()
data.rename(index = {101024: "STUDENT1", 101025:"STUDENT2"},
inplace = True)
print(data)
Example 2 - Changing multiple column labels/names
# importing pandas module
import pandas as pd
# making data frame from csv file
data = pd.read_csv("DATA.CSV", index_col ="ADM_NO" )
# changing columns indexes/names/labels with rename()
data.rename(columns = {'AGE':'YEARS', 'ADM_NO':'ID'},
inplace = True)
print(data)
Output –

g) Boolean Indexing – In Boolean indexing, we will select subsets of data based on the actual values of
the data in the DataFrame and not on their row/column labels or integer locations. In Boolean indexing,
we use a Boolean vector to filter the data.
Boolean indexing is a type of indexing which uses actual values of the data in the DataFrame. In Boolean
indexing, we can filter a data in four ways –
 Accessing a DataFrame with a Boolean index - In order to access a dataframe with a Boolean
index, we have to create a dataframe in which index of dataframe contains a Boolean value that is
“True” or “False”.
For example –
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = { 'name' : ["aparna", "pankaj", "sudhir", "Geeku"],
'degree' : ["MBA", "BCA", "M.Tech", "MBA"],
'score' : [90, 40, 80, 98]}
df = pd.DataFrame(dict, index = [True, False, True, False])
print(df)
Output –
Now we have created a dataframe with Boolean index after that user can access a dataframe with
the help of Boolean index.
Accessing a Dataframe with a Boolean index using .loc[ ]
In order to access a dataframe with a Boolean index using .loc[ ], we simply pass a Boolean value
(True or False) in a .loc[ ] function.
Example – print(df.loc[True])
Output -

Accessing a Dataframe with a Boolean index using .iloc[ ]


In order to access a dataframe using .iloc[ ], we have to pass a Boolean value (True or False) in a
iloc[ ] function but iloc[ ] function accept only integer as argument so it will throw an error so we can
only access a dataframe when we pass a integer in iloc[ ] function
Example – print(df.iloc[2])
Output –

 Applying a Boolean mask to a dataframe - We can apply a Boolean mask by giving list of True and
False of the same length as contain in a dataframe. When we apply a Boolean mask it will print only
that dataframe in which we pass a Boolean value True.
Example –
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = { 'name' : ["aparna", "pankaj", "sudhir", "Geeku"],
'degree' : ["MBA", "BCA", "M.Tech", "MBA"],
'score' : [90, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df[[True, False, True, False]])
Output –

 Masking data based on column value - In a dataframe we can filter a data based on a column value
in order to filter data, we can apply certain condition on dataframe using different operator like ==,
>, <, <=, >=. When we apply these operator on dataframe then it produce a Series of True and False.
Example 1 – print(df[df['score']>90])
Output –

Example 2 –
masking_condition =(df['score']>80) & (df['degree']=='MBA')
print(df[masking_condition])
Output –

 Masking data based on index value - In a dataframe we can filter a data based on a column value
in order to filter data, we can create a mask based on the index values using different operator like
==, >, <, etc.
Example – print (df[df.index>1])
Output –

You might also like