You are on page 1of 75

Unit I:

Data Handling using Pandas and


Data
Visualization
Marks :30
Data Handling using Pandas -I
Introduction to Python libraries- Pandas, Matplotlib. Data structures in Pandas -
Series and Data Frames. Series: Creation of Series from – ndarray, dictionary,
scalar value; mathematical operations; Head and Tail functions; Selection,
Indexing and Slicing.
Data Frames: creation - from dictionary of Series, list of dictionaries, Text/CSV
files; display; iteration; Operations on rows and columns: add, select, delete,
rename; Head and Tail functions; Indexing using Labels, Boolean Indexing; Joining,
Merging and Concatenation.
Importing/Exporting Data between CSV files and Data Frames.
Data handling using Pandas – II
Descriptive Statistics: max, min, count, sum, mean, median, mode,
quartile, Standard deviation, variance.
DataFrame operations: Aggregation, group by, Sorting, Deleting and
Renaming Index, Pivoting. Handling missing values – dropping and filling.
Importing/Exporting Data between MySQL database and Pandas.
Introduction to Python Library : Pandas ( Python for Data Analysis)
Introduction to Python Libraries
• Python libraries contain a collection of built-in modules that allow us to perform
many actions without writing detailed programs for it.

• We have to import these libraries for calling its functions

• NumPy, Pandas and Matplotlib are three well-established Python libraries. These
libraries allows us to manipulate, transform and visualize data easily and efficiently.

• NumPy  Numerical Python, uses a multidimensional array object and has


functions for working with these arrays. It is used for numerical analysis and
scientific computing.
• PANDAS(PANelDAta) is a high-level data manipulation tool used for
analyzing data. It is very easy to import and export data using Panda's
library. It is built on packages like NumPy and Matplotlib to do data analysis
and visualization work. Series, DataFrame and Panel to make the process
of analyzing data organized, effective and efficient.

• Matplotlib It is used for 2D graph plotting graphs and visualization. It is


built on NumPy and its designed to work well with NumPy and Pandas.
Difference between Pandas and NumPy
• A NumPy array requires homogeneous data, while DataFrame of Pandas
uses different data types.
• Pandas have simple interface for operations like file loading, plotting,
selection, joining, GROUP BY operations.
• Pandas DataFrame with column names makes it easy to keep track of data.
• Pandas is used when data is in tabular format, whereas NumPy is used for
numeric array based data manipulation.
Installing Pandas:
• Open command prompt
• Type cd\ to move to the root directory
• Type pip install Pandas
• pip Python package Installer
Note: With the installation of Pandas, NumPy(Numeric Python) will also be installed
automatically. Pandas cannot handle arrays on its own. NumPy is the library which can
handle arrays.
Testing Pandas
* Type import pandas as pd in the IDLE shell
>>> import pandas as pd
DATA STRUCTURES IN PANDAS
• It is a way of storing and organizing data in a computer .
• Three types of Data Structures namely
1) Series It is a one-Dimensional Structure storing homogeneous (same data type)
mutable(which can be modified) data such as integer, string.
2) DataFrames It is a two-Dimensional structure storing heterogeneous (multiple data
type) mutable data.
3) Panel  It is a three-dimensional method of storing data( Not in Syllabus)
1. Series
• Series is like a one-dimensional array like structure with homogeneous
(same type of) data.
• Data label associated with particular value is called its index.

For example, the following series is a collection of integers.


49 55 10 79 67

Basic feature of series are


Homogeneous data
 Size of series data is Immutable ( we cannot change the size of series data)
Series Data is Mutable
Series is a one-dimensional labelled structure capable of holding any data
type(integers, strings ,floating point numbers, python objects, etc.…)

A series can also be described as an ordered dictionary with mapping of


index values to data values.

Example of series–type objects

Index Data Index Data Index Data


0 22 ‘Jan’ 31 ‘Sunday’ 1
1 -14 ‘Feb’ 28 ‘Monday’ 2
2 52 ‘Mar’ 31 ‘Tuesday’ 3
3 100 ‘April’ 30 ‘Wednesday’ 4
How to create series in pandas:
• Using Series() method
• List or dictionary data can be converted into series using
this method.
2. DataFrame
DataFrame is like a two-dimensional array with heterogeneous data.

Basic feature of DataFrame are


 Heterogeneous data
 Size Mutable
 Data Mutable
Create a series with your 3 friends name.
>>> import pandas as pd
>>> data=['Abey','Bhasu','Charlie']
>>> s1=pd.Series(data)
>>> print(s1)
Output:
0 Abey
1 Bhasu
2 Charlie
dtype: object
Create a series with your 3 friends name with
index values.
>>> import pandas as pd
>>> data=['Abey','Bhasu','Charlie']
>>> s1=pd.Series(data, index=[3,5,1])
>>> print(s1)
Output:
3 Abey
5 Bhasu
1 Charlie
dtype: object
Home work
• Create a series with first four months as index and no of days in it as
data.
• Create a series having names of any five famous monuments of India
and assign their states as index values.
Creating an empty series using Series() Method:
• It is created by Series() method with no arguments in it.

# Example 1: Empty Series using Series() Method


>>> import pandas as pd
>>> s1 = pd.Series()
>>> print(s1)

Output:
Series([], dtype: float64)
Creating a series using Series() method with Arguments
A series is created using Series() method by passing index and data elements as
the arguments to it.
Syntax:
<Series object> = pandas. Series(data, index =idx)
* series output has 2 columns index on left and data value is on right. If we don’t
specify index, default index will be taken from 0 to N-1.
Create a Series using List:
# Example 2: creating a series using Series() with List as an argument
>>> import pandas as pd
>>> s1 = pd. Series([10,20,30,40])
>>> s1
( or )
>>> print(s1)
Output:
0 10
1 20
2 30
3 40
dtype: int64
Creating a series using range method
>>>import pandas as pd
>>> s1 = pd.Series(range(5))
>>> print(s1)
0 0
1 1
2 2
3 3
4 4
dtype: int64
Creating a series with explicit index values:
>>> import pandas as pd
>>> s1 = pd. Series( [10, 20, 30, 40, 50], index = ['a’, 'b',’ c',’ d',’ e’] )
>>> print(s1)
a 10
b 20
c 30
d 40
e 50
dtype: int64
Creating a Series from ndarray
Without index Argument
>>> import pandas as pd
>>> import numpy as np
>>> data = np. array (['a’, 'b’, 'c’, 'd'])
>>> s1 = pd.Series(data)
>>> print(s1)
Output:
0 a
1 b
2 c
3 d
dtype: object
Creating a Series from ndarray
With index Argument
>>> import pandas as pd
>>> import numpy as np
>>> data = np. array (['a’, 'b’, 'c’, 'd’])
>>> s1 = pd.Series( data, index=[100,101,102,103] )
>>> print(s1)
Ouput:
100 a
101 b
102 c
103 d
dtype: object
Create a Series from dict
Eg.1(without index)
>>> import pandas as pd
>>> data = {'a':0,'b':1,'c':2}
>>> s1 = pd.Series ( data)
>>> print(s1)
Output:
a 0
b 1
c 2
dtype: int64
Eg.2 (with index)
>>> import pandas as pd
>>> data = {'a':0,'b':1,'c':2}
>>> s1 =pd.Series( data, index= ['b' ,'c', 'd' ,'a'])
>>> print(s1)
Output:
b 1.0
c 2.0
d NaN  Not a Number
a 0.0
dtype: float64
Create a Series from Scalar
>>> import pandas as pd
>>> s1 =pd.Series(5, index=[1,2,3,4])
>>> print(s1)
Output:
1 5
2 5
3 5
4 5
dtype: int64
Note :- here 5 is repeated for 4 times (as per no of index)
Creating a series using arange method of numpy
>>> import pandas as pd

>>> import numpy as np

>>> s1=pd.Series(np.arange(10,16,1),index=['a','b','c','d','e','f'])

>>>print(s1)
a 10
b 11
c 12
d 13
e 14
f 15
dtype: int32
Accessing elements of a series
* There are 2 methods indexing and slicing
A) Indexing
Two types of indexes are: positional index and labelled index. Positional indexing
is default index starting from 0, whereas labelled index is user defined index.
Example 1:
>>> import pandas as pd
>>>s1 = pd.Series([ 10, 20,30, 40,50])
>>>print(s1[2] )
30
Example 2:
>>> import pandas as pd
>>>s1 = pd.Series([ 10, 20,30, 40,50],index = ['a','b','c','d','e'])
>>> print(s1['d'] )
40
>>> s1[['a','c','e’]]

(or)

>>> print(s1[['a','c','e']])

Output:

a 10

c 30

e 50

dtype: int64
Example 3:
>>>import pandas as pd
>>>sercap=pd.Series([‘NewDelhi’,’London’,’Paris’],
index=[‘India’,’UK’,’France’])
>>>print(sercap[[‘UK’,’France’]])
>>>sercap[‘India’]
or UK London
France Paris
>>>print(sercap[‘India’]) dtype: object
NewDelhi
How to assign new index values to series
>>>sercap.index=[10,20,30]
>>>print(sercap)

10 NewDelhi
20 London
30 Paris
dtype: Object
B) Slicing
• Similar to slicing with NumPy arrays
• Slicing can be done by specifying the starting and ending parameters.
• In positional index the value at the end index position is excluded.
Example:
>>>import pandas as pd
>>>sercap=pd.Series([‘NewDelhi’, ’WashingtonDC’, ’London’, ’Paris’], index=[‘India’,
’USA’, ’UK’, ’France’])
>>>print(sercap[1:3])
output
USA WashingtonDC
UK London
dtype: object
Example using labelled index

>>>import pandas as pd
>>>sercap=pd.Series([‘NewDelhi’, ’WashingtonDC’, ’London’, ’Paris’],
index=[‘India’, ’USA’, ’UK’, ’France’])
>>>print(sercap[‘USA’: ‘France’])

USA WashingtonDC
UK London
France Paris
dtype: object
Series in reverse order slicing
>>> import pandas as pd

>>> sercap=pd.Series(['NewDelhi','WashingtonDC','London','Paris'],
index=['India','USA','UK','France'])

>>>print(sercap[: : -1])

France Paris

UK London

USA WashingtonDC

India NewDelhi

dtype: object
How to modify the values of series using slicing
>>> import pandas as pd
>>> import numpy as np
>>> s1=pd.Series(np. arange(10,16,1),index=['a','b','c','d','e','f'])
>>> s1[1:3]=50
>>> print(s1)
a 10
b 50
c 50
d 13
e 14
f 15
dtype: int32
Example 2: using index label
>>> import pandas as pd
>>> import numpy as np
>>> s1=pd.Series(np. arange (10,16,1),index=['a', 'b', 'c', 'd', 'e‘ ,'f'])
>>> s1['c' :'e']=500
>>> print(s1)
a 10
b 11
c 500
d 500
e 500
f 15
dtype: int32
Accessing Data from Series with indexing and slicing( using position)
e.g. import pandas as pd
>>> s1 = pd.Series([11, 12 ,13 ,14,15],index=[ 'a',’ b’, 'c’, 'd’, 'e'])
>>> print(s1[0]) >>>print(s1[‘a’])
11
>>> print(s1[:3])
a 11
b 12
c 13
dtype: int64
>>> print(s1[-3:])
c 13
d 14
e 15
dtype: int64
In the first statement the element at ‘0’ position is displayed.

In the second statement the first 3 elements from the list are displayed.

In the third statement last 3 index values are displayed because of negative indexing.
Retrieve Data from selection :

There are three methods for data selection:

• loc is used for indexing or selecting based on name, i.e., by row name and

column name. It refers to name-based indexing .

loc = [< list of row names>, <list of column names>]

• iloc is used for indexing or selecting based on position , i.e., by row number

and column number. It refers to position-based indexing.

iloc =[<row number range>,<column number range>]


• ix usually tries to behave like loc but falls back to behaving like iloc if a label
is not present in the index. ix is deprecated and the use of loc and iloc is
encouraged instead
>>> # usage of loc and iloc for accessing elements of a series
>>> import pandas as pd
>>> s = pd.Series([11,12,13,14,15],index= ['a’, 'b’, 'c’, 'd’, 'e'])
>>> print( s.loc [ 'b’ : 'e’]) >>>print(s .iloc [1:4])
b 12 b 12
c 13 c 13
d 14 d 14
e 15 dtype: int64
dtype: int64
Pandas Series Retrieve Data from selection
e.g.1 >>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series( np.NaN, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>>print(s1. iloc[:3]) # slice the first three rows

Output: >>>print(s1.loc[49:47])
49 NaN
48 NaN
47 NaN
dtype: float 64
e.g.2 >>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series( np. nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>>print(s1. loc[ 49 : 1] ) # selects the data according to the index name
Output:
49 NaN >>>print(s1.iloc[ :6])
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
dtype: float 64
Conditional Filtering Entries:
>>> import pandas as pd
>>> s1 = pd. Series([1.00000,1.414214,1.730751,2.000000])
>>> print(s1) >>> print(s1 < 2)
Output: Output :
0 1.000000 0 True
1 1.414214 1 True
2 1.730751 2 True
3 2.000000
3 False
dtype: float64
dtype: bool
>>>print(s1>=2)
Note :
>>>print(s1 [s1>=2]) • In the statement s <2 , it performs a vectorized operation
Output: which checks every element in the series.
3 2.0 • In the statement s1[s1>=2] it performs filtering operation
dtype: float64 and returns filter result whose values return True for the
>>> print(s1 [s1 < 2]) expression.
Output:
0 1.000000
1 1.414214
2 1.730751
dtype: float64
Conditional Filtering Entries
Filtering entries from a series object can be done using expressions that are of
Boolean type.
<Series object> [ <Boolean expression on series object>]

Example:
Series object s11 stores the charity contribution made by each section

A 6700
B 5600
C 5000
D 5200
Write a program to display which section contributed more than Rs. 5500
Output:
Contribution >5500 are:
A 6700
B 5600
dtype: int64
Program:
>>> import pandas as pd
>>> s11= pd.Series([6700,5600,5000,5200],index=['A','B','C','D'])
>>> print("Contribution >5500 are:")
>>> print(s11[s11>5500])

Output:
Contribution >5500 are:
A 6700
B 5600
dtype: int64
Sorting Series values:
Series object can be sorted based on values and indexes.

• Sorting on basis of values:


<Series object >.sort _values ([ascending =True | False])
If S1 is
A 6700
B 5600
C 5000
D 5200
dtype: int64

>>>print(s1.sort_values()) >>> print(s1.sort_values(ascending=False))


Output: Output:
C 5000 A 6700
D 5200 B 5600
B 5600 D 5200
A 6700 C 5000
dtype: int64 dtype: int64
• Sorting on basis of indexes
<Series object >.sort _index ([ascending =True | False])
If S1 is
A 6700
B 5600
C 5000
D 5200
dtype: int64

>>> s1.sort_index() >>> s1.sort_index(ascending=False)


Output: Output:
A 6700 D 5200
B 5600 C 5000
C 5000 B 5600
D 5200 A 6700
dtype: int64 dtype: int64
Deleting elements from a Series
• Element of a series can be deleted by using drop () method by passing the index as
argument.
Example
>>> import pandas as pd
>>> s1 = pd. Series([1.00000,1.414214,1.730751,2.000000], index= range(1,5))
>>> print(s1)
Output
1 1.00000
2 1.414214
3 1.730751
4 2.000000
dtype: float64
To remove one element from the series
>>> print(s1.drop (3)) # to drop the element temporarily
>>>s1.drop(3, inplace=True) # if the element has to be dropped
permanently
>>>print(s1)
Output:
1 1.000000
2 1.414214
4 2.000000
dtype: float64
To remove more than one element from the
series
>>>print(s1.drop([1,3]))
Output:
2 1.414214
4 2.000000
dtype: float64
Methods of Series

• head()

• tail()

• count()
• Series .head () is a series function that fetches first ‘n’ from a Pandas object.
• By default it gives the top 5 rows of the series.
• Series. tail () is a series function displays the last five elements by default.

Example 1: Example 2:
>>>import pandas as pd >>>import pandas as pd
>>> s1=pd.Series([1,2,3,4,5],index=['a','b','c','d','e']) >>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> print(s1.head(3)) >>> print(s1.head())
output output
a 1 a 1
b 2
b 2 c 3
c 3 d 4
e 5
dtype: int64 dtype: int64
Pandas tail () function:
>>>import pandas as pd
>>>import pandas as pd >>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d’,’e])
>>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d’,’e]) >>> print(s1.tail())
>>> print(s1.tail(2)) Output:
Output: a 1
d 4 b 2
e 5 c 3
dtype: int64 d 4
e 5
dtype: int64
pandas count() function:

• Returns the number of non-NaN values in the


>>>import pandas as pd
series.
>>> import numpy as np

>>> import pandas as pd


>>> s1=pd.Series([1,2,np.nan,4,5],

>>> s1=pd.Series([1,2,3,4,5], index= ['a','b','c','d','e'])


index=['a','b','c','d','e'])

>>> print(s1.count())
>>> print(s1.count())

output
output
5 4
Homework
Consider the following code:
>>> import pandas as pd
>>> import numpy as np
>>> s1=pd.Series([12,np.nan,10])
>>> print(s1)
Find the output and write a python statement to count and display only non null
values in the above series.
Output
0 12.0 ii) >>> s1.count()
2
1 NaN
2 10.0
dtype: float64
Series Object Attributes:
Properties of a series through its associated attributes.
1) Series. index  returns index of the series
2) Series. values  returns ndarray
3) Series. dtype  returns dtype object of the underlying data.
4) Series. shape  returns tuple of the shape of the underlying data.
5) Series. nbytes  returns number of bytes of underlying data.
6) Series. ndim  returns the number of dimension
7) Series. size  returns number of elements.
8) Series. hasnans  returns true if there is any NaN
9) Series. empty  returns true if series object is empty.
Naming the Series and the index column
>>> import pandas as pd
>>> >>> s1 = pd.Series({'Jan':31,"Feb":28,"Mar":31,"Apr":30})
>>> s1.name="Days"
>>> s1.index.name="Months"
>>> print(s1)
Output:
Months
Jan 31
Feb 28
Mar 31
Apr 30
Name: Days, dtype: int64
>>> import pandas as pd
>>> s1 = pd.Series( range(1, 15, 3), index= [x for x in 'abcde'])
>>> s1.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> s1.values
array([ 1, 4, 7, 10, 13], dtype=int64)
>>> s1.dtype
dtype('int64')
>>> s1.shape
(5,)
>>> s1.nbytes
40
>>> s1.ndim
1
>>> s1.size
5
>>> s1.hasnans
False
>>> s1.empty
False
Sumitha Arora pg no 297 class 11
• Int 8  1 byte
• Int 16  2 bytes
• Int 32  4 bytes
• Int 64  8 bytes
Mathematical operations with Series
e.g.1: e.g.2:
import pandas as pd import pandas as pd
>>> s1 = pd.Series([1,2,3]) >>> s1 = pd.Series([1,2,3])
>>> s2 = pd.Series([1,2,4]) >>> s2 = pd.Series([1,2,4])
>>> s3 = s1+s2 >>> s3 = s1 * s2
>>> print(s3) >>> print(s3)
Output: Output:
0 2 0 1
1 4 1 4
2 7 2 12
dtype: int64 dtype: int64
Mathematical operations with Series
e.g. 4
e.g. 3
>>>import pandas as pd
>>>import pandas as pd
>>> import numpy as np
>>> import numpy as np
>>> s1 = np. arange(10,15)
>>> s1 = np. arange(10,15)
>>> s2 = pd.Series(index= s1, data= s1**4)
>>> s2 = pd.Series(index= s1, data= s1 *4)
>>> print(s2)
>>> print(s2)
Output:
Output:
10 10000
10 40
11 14641
11 44
12 20736
12 48
13 28561
13 52
14 38416
14 56
dtype: int32
dtype: int32
Mathematical operations with Series
e.g. 6
e.g. 5 concat your firstname with your lastname
>>> import pandas as pd >>>import pandas as pd
>>> data =['I','n','f','o','r’] >>> s1 = [ 'a',’ b’, 'c’]
>>> s1 = pd.Series(data+['m','a','t','i','c','s’])
>>> s1 >>> s2 = pd.Series(data= s1 *2)
Output: >>> print(s2)
0 I
1 n Output:
2 f 0 a
3 o
4 r 1 b
5 m 2 c
6 a
7 t 3 a
8 i 4 b
9 c
10 s 5 c
dtype: object dtype: object
Note :
• Arithmetic operations is possible on objects of same index;
otherwise will result as NaN
Homework:

Differentiate between Numpy Arrays and Series


objects

Draw tables of subtraction showing the changes in


the series elements and corresponding output
without replacing missing values and after replacing
the missing values with 1000.
Homework
>>> seriesA=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> seriesB=pd.Series([10,20,-10,-50,100],index=['z','x','a','c','e'])
>>> print(seriesA-seriesB)
Output
a 11.0
b NaN
c 53.0
d NaN
e -95.0
x NaN
z NaN
dtype: float64
Subtraction after replacing with NaN values
>>> print(seriesA.sub(seriesB,fill_value=1000))
a 11.0
b -998.0
c 53.0
d -996.0
e -95.0
x 980.0
z 990.0
dtype: float64

You might also like