Unit I: Data Handling Using Pandas and Data Visualization: Marks:30

Unit I:
Data Handling using Pandas and

Data
Visualization
Marks :30
Data Handling using Pandas -I
Introduction to Python libraries- Pandas, Matplotlib. Data structures in Pandas -
Series and Data Frames. Series: Creation of Series from – ndarray, dictionary,
scalar value; mathematical operations; Head and Tail functions; Selection,
Indexing and Slicing.
Data Frames: creation - from dictionary of Series, list of dictionaries, Text/CSV
files; display; iteration; Operations on rows and columns: add, select, delete,
rename; Head and Tail functions; Indexing using Labels, Boolean Indexing; Joining,
Merging and Concatenation.
Importing/Exporting Data between CSV files and Data Frames.
Data handling using Pandas – II
Descriptive Statistics: max, min, count, sum, mean, median, mode,
quartile, Standard deviation, variance.
DataFrame operations: Aggregation, group by, Sorting, Deleting and
Renaming Index, Pivoting. Handling missing values – dropping and filling.
Importing/Exporting Data between MySQL database and Pandas.
Introduction to Python Library : Pandas ( Python for Data Analysis)
Introduction to Python Libraries
• Python libraries contain a collection of built-in modules that allow us to perform
many actions without writing detailed programs for it.
• We have to import these libraries for calling its functions
• NumPy, Pandas and Matplotlib are three well-established Python libraries. These
libraries allows us to manipulate, transform and visualize data easily and efficiently.
• NumPy  Numerical Python, uses a multidimensional array object and has

functions for working with these arrays. It is used for numerical analysis and
scientific computing.
• PANDAS(PANelDAta) is a high-level data manipulation tool used for
analyzing data. It is very easy to import and export data using Panda's
library. It is built on packages like NumPy and Matplotlib to do data analysis
and visualization work. Series, DataFrame and Panel to make the process
of analyzing data organized, effective and efficient.
• Matplotlib It is used for 2D graph plotting graphs and visualization. It is

built on NumPy and its designed to work well with NumPy and Pandas.
Difference between Pandas and NumPy
• A NumPy array requires homogeneous data, while DataFrame of Pandas
uses different data types.
• Pandas have simple interface for operations like file loading, plotting,
selection, joining, GROUP BY operations.
• Pandas DataFrame with column names makes it easy to keep track of data.
• Pandas is used when data is in tabular format, whereas NumPy is used for
numeric array based data manipulation.
Installing Pandas:
• Open command prompt
• Type cd\ to move to the root directory
• Type pip install Pandas
• pip Python package Installer
Note: With the installation of Pandas, NumPy(Numeric Python) will also be installed
automatically. Pandas cannot handle arrays on its own. NumPy is the library which can
handle arrays.
Testing Pandas
* Type import pandas as pd in the IDLE shell
>>> import pandas as pd
DATA STRUCTURES IN PANDAS
• It is a way of storing and organizing data in a computer .
• Three types of Data Structures namely
1) Series It is a one-Dimensional Structure storing homogeneous (same data type)
mutable(which can be modified) data such as integer, string.
2) DataFrames It is a two-Dimensional structure storing heterogeneous (multiple data
type) mutable data.
3) Panel  It is a three-dimensional method of storing data( Not in Syllabus)
1. Series
• Series is like a one-dimensional array like structure with homogeneous
(same type of) data.
• Data label associated with particular value is called its index.
For example, the following series is a collection of integers.

49 55 10 79 67
Basic feature of series are

Homogeneous data
 Size of series data is Immutable ( we cannot change the size of series data)
Series Data is Mutable
Series is a one-dimensional labelled structure capable of holding any data
type(integers, strings ,floating point numbers, python objects, etc.…)
A series can also be described as an ordered dictionary with mapping of

index values to data values.
Example of series–type objects
Index Data Index Data Index Data

0 22 ‘Jan’ 31 ‘Sunday’ 1
1 -14 ‘Feb’ 28 ‘Monday’ 2
2 52 ‘Mar’ 31 ‘Tuesday’ 3
3 100 ‘April’ 30 ‘Wednesday’ 4
How to create series in pandas:
• Using Series() method
• List or dictionary data can be converted into series using
this method.
2. DataFrame
DataFrame is like a two-dimensional array with heterogeneous data.
Basic feature of DataFrame are

 Heterogeneous data
 Size Mutable
 Data Mutable
Create a series with your 3 friends name.
>>> data=['Abey','Bhasu','Charlie']
>>> s1=pd.Series(data)
>>> print(s1)
Output:
0 Abey
1 Bhasu
2 Charlie
dtype: object
Create a series with your 3 friends name with
index values.
>>> data=['Abey','Bhasu','Charlie']
>>> s1=pd.Series(data, index=[3,5,1])
>>> print(s1)
Output:
3 Abey
5 Bhasu
1 Charlie
dtype: object
Home work
• Create a series with first four months as index and no of days in it as
data.
• Create a series having names of any five famous monuments of India
and assign their states as index values.
Creating an empty series using Series() Method:
• It is created by Series() method with no arguments in it.
# Example 1: Empty Series using Series() Method

>>> s1 = pd.Series()
>>> print(s1)
Output:
Series([], dtype: float64)
Creating a series using Series() method with Arguments
A series is created using Series() method by passing index and data elements as
the arguments to it.
Syntax:
<Series object> = pandas. Series(data, index =idx)
* series output has 2 columns index on left and data value is on right. If we don’t
specify index, default index will be taken from 0 to N-1.
Create a Series using List:
# Example 2: creating a series using Series() with List as an argument
>>> s1 = pd. Series([10,20,30,40])
>>> s1
( or )
>>> print(s1)
Output:
0 10
1 20
2 30
3 40
dtype: int64
Creating a series using range method
>>>import pandas as pd
>>> s1 = pd.Series(range(5))
>>> print(s1)
0 0
1 1
2 2
3 3
4 4
dtype: int64
Creating a series with explicit index values:
>>> s1 = pd. Series( [10, 20, 30, 40, 50], index = ['a’, 'b',’ c',’ d',’ e’] )
>>> print(s1)
a 10
b 20
c 30
d 40
e 50
dtype: int64
Creating a Series from ndarray
Without index Argument
>>> import numpy as np
>>> data = np. array (['a’, 'b’, 'c’, 'd'])
>>> s1 = pd.Series(data)
>>> print(s1)
Output:
0 a
1 b
2 c
3 d
dtype: object
Creating a Series from ndarray
With index Argument
>>> data = np. array (['a’, 'b’, 'c’, 'd’])
>>> s1 = pd.Series( data, index=[100,101,102,103] )
>>> print(s1)
Ouput:
100 a
101 b
102 c
103 d
dtype: object
Create a Series from dict
Eg.1(without index)
>>> data = {'a':0,'b':1,'c':2}
>>> s1 = pd.Series ( data)
>>> print(s1)
Output:
a 0
b 1
c 2
dtype: int64
Eg.2 (with index)
>>> data = {'a':0,'b':1,'c':2}
>>> s1 =pd.Series( data, index= ['b' ,'c', 'd' ,'a'])
>>> print(s1)
Output:
b 1.0
c 2.0
d NaN  Not a Number
a 0.0
dtype: float64
Create a Series from Scalar
>>> s1 =pd.Series(5, index=[1,2,3,4])
>>> print(s1)
Output:
1 5
2 5
3 5
4 5
dtype: int64
Note :- here 5 is repeated for 4 times (as per no of index)
Creating a series using arange method of numpy
>>> s1=pd.Series(np.arange(10,16,1),index=['a','b','c','d','e','f'])
>>>print(s1)
a 10
b 11
c 12
d 13
e 14
f 15
dtype: int32
Accessing elements of a series
* There are 2 methods indexing and slicing
A) Indexing
Two types of indexes are: positional index and labelled index. Positional indexing
is default index starting from 0, whereas labelled index is user defined index.
Example 1:
>>>s1 = pd.Series([ 10, 20,30, 40,50])
>>>print(s1[2] )
30
Example 2:
>>>s1 = pd.Series([ 10, 20,30, 40,50],index = ['a','b','c','d','e'])
>>> print(s1['d'] )
40
>>> s1[['a','c','e’]]
(or)
>>> print(s1[['a','c','e']])
Output:
a 10
c 30
e 50
dtype: int64
Example 3:
>>>sercap=pd.Series([‘NewDelhi’,’London’,’Paris’],
index=[‘India’,’UK’,’France’])
>>>print(sercap[[‘UK’,’France’]])
>>>sercap[‘India’]
or UK London
France Paris
>>>print(sercap[‘India’]) dtype: object
NewDelhi
How to assign new index values to series
>>>sercap.index=[10,20,30]
>>>print(sercap)
10 NewDelhi
20 London
30 Paris
dtype: Object
B) Slicing
• Similar to slicing with NumPy arrays
• Slicing can be done by specifying the starting and ending parameters.
• In positional index the value at the end index position is excluded.
Example:
>>>sercap=pd.Series([‘NewDelhi’, ’WashingtonDC’, ’London’, ’Paris’], index=[‘India’,
’USA’, ’UK’, ’France’])
>>>print(sercap[1:3])
output
USA WashingtonDC
UK London
dtype: object
Example using labelled index
>>>sercap=pd.Series([‘NewDelhi’, ’WashingtonDC’, ’London’, ’Paris’],
index=[‘India’, ’USA’, ’UK’, ’France’])
>>>print(sercap[‘USA’: ‘France’])
USA WashingtonDC
UK London
France Paris
dtype: object
Series in reverse order slicing
>>> sercap=pd.Series(['NewDelhi','WashingtonDC','London','Paris'],
index=['India','USA','UK','France'])
>>>print(sercap[: : -1])
France Paris
UK London
USA WashingtonDC
India NewDelhi
dtype: object
How to modify the values of series using slicing
>>> s1=pd.Series(np. arange(10,16,1),index=['a','b','c','d','e','f'])
>>> s1[1:3]=50
>>> print(s1)
a 10
b 50
c 50
d 13
e 14
f 15
dtype: int32
Example 2: using index label
>>> s1=pd.Series(np. arange (10,16,1),index=['a', 'b', 'c', 'd', 'e‘ ,'f'])
>>> s1['c' :'e']=500
>>> print(s1)
a 10
b 11
c 500
d 500
e 500
f 15
dtype: int32
Accessing Data from Series with indexing and slicing( using position)
e.g. import pandas as pd
>>> s1 = pd.Series([11, 12 ,13 ,14,15],index=[ 'a',’ b’, 'c’, 'd’, 'e'])
>>> print(s1[0]) >>>print(s1[‘a’])
11
>>> print(s1[:3])
a 11
b 12
c 13
dtype: int64
>>> print(s1[-3:])
c 13
d 14
e 15
dtype: int64
In the first statement the element at ‘0’ position is displayed.
In the second statement the first 3 elements from the list are displayed.
In the third statement last 3 index values are displayed because of negative indexing.
Retrieve Data from selection :
There are three methods for data selection:
• loc is used for indexing or selecting based on name, i.e., by row name and
column name. It refers to name-based indexing .
loc = [< list of row names>, <list of column names>]
• iloc is used for indexing or selecting based on position , i.e., by row number
and column number. It refers to position-based indexing.
iloc =[<row number range>,<column number range>]

• ix usually tries to behave like loc but falls back to behaving like iloc if a label
is not present in the index. ix is deprecated and the use of loc and iloc is
encouraged instead
>>> # usage of loc and iloc for accessing elements of a series
>>> s = pd.Series([11,12,13,14,15],index= ['a’, 'b’, 'c’, 'd’, 'e'])
>>> print( s.loc [ 'b’ : 'e’]) >>>print(s .iloc [1:4])
b 12 b 12
c 13 c 13
d 14 d 14
e 15 dtype: int64
dtype: int64
Pandas Series Retrieve Data from selection
e.g.1 >>> import pandas as pd
>>> s1 = pd.Series( np.NaN, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>>print(s1. iloc[:3]) # slice the first three rows
Output: >>>print(s1.loc[49:47])
49 NaN
48 NaN
47 NaN
dtype: float 64
e.g.2 >>> import pandas as pd
>>> s1 = pd.Series( np. nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>>print(s1. loc[ 49 : 1] ) # selects the data according to the index name
Output:
49 NaN >>>print(s1.iloc[ :6])
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
dtype: float 64
Conditional Filtering Entries:
>>> s1 = pd. Series([1.00000,1.414214,1.730751,2.000000])
>>> print(s1) >>> print(s1 < 2)
Output: Output :
0 1.000000 0 True
1 1.414214 1 True
2 1.730751 2 True
3 2.000000
3 False
dtype: float64
dtype: bool
>>>print(s1>=2)
Note :
>>>print(s1 [s1>=2]) • In the statement s <2 , it performs a vectorized operation
Output: which checks every element in the series.
3 2.0 • In the statement s1[s1>=2] it performs filtering operation
dtype: float64 and returns filter result whose values return True for the
>>> print(s1 [s1 < 2]) expression.
Output:
0 1.000000
1 1.414214
2 1.730751
dtype: float64
Conditional Filtering Entries
Filtering entries from a series object can be done using expressions that are of
Boolean type.
<Series object> [ <Boolean expression on series object>]
Example:
Series object s11 stores the charity contribution made by each section
A 6700
B 5600
C 5000
D 5200
Write a program to display which section contributed more than Rs. 5500
Output:
Contribution >5500 are:
A 6700
B 5600
dtype: int64
Program:
>>> s11= pd.Series([6700,5600,5000,5200],index=['A','B','C','D'])
>>> print("Contribution >5500 are:")
>>> print(s11[s11>5500])
Output:
Contribution >5500 are:
A 6700
B 5600
dtype: int64
Sorting Series values:
Series object can be sorted based on values and indexes.
• Sorting on basis of values:

<Series object >.sort _values ([ascending =True | False])
If S1 is
A 6700
B 5600
C 5000
D 5200
dtype: int64
>>>print(s1.sort_values()) >>> print(s1.sort_values(ascending=False))

Output: Output:
C 5000 A 6700
D 5200 B 5600
B 5600 D 5200
A 6700 C 5000
dtype: int64 dtype: int64
• Sorting on basis of indexes
<Series object >.sort _index ([ascending =True | False])
If S1 is
A 6700
B 5600
C 5000
D 5200
dtype: int64
>>> s1.sort_index() >>> s1.sort_index(ascending=False)

Output: Output:
A 6700 D 5200
B 5600 C 5000
C 5000 B 5600
D 5200 A 6700
Deleting elements from a Series
• Element of a series can be deleted by using drop () method by passing the index as
argument.
Example
>>> s1 = pd. Series([1.00000,1.414214,1.730751,2.000000], index= range(1,5))
>>> print(s1)
Output
1 1.00000
2 1.414214
3 1.730751
4 2.000000
dtype: float64
To remove one element from the series
>>> print(s1.drop (3)) # to drop the element temporarily
>>>s1.drop(3, inplace=True) # if the element has to be dropped
permanently
>>>print(s1)
Output:
1 1.000000
2 1.414214
4 2.000000
dtype: float64
To remove more than one element from the
series
>>>print(s1.drop([1,3]))
Output:
2 1.414214
4 2.000000
dtype: float64
Methods of Series
• head()
• tail()
• count()
• Series .head () is a series function that fetches first ‘n’ from a Pandas object.
• By default it gives the top 5 rows of the series.
• Series. tail () is a series function displays the last five elements by default.
Example 1: Example 2:
>>>import pandas as pd >>>import pandas as pd
>>> s1=pd.Series([1,2,3,4,5],index=['a','b','c','d','e']) >>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> print(s1.head(3)) >>> print(s1.head())
output output
a 1 a 1
b 2
b 2 c 3
c 3 d 4
e 5
Pandas tail () function:
>>>import pandas as pd >>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d’,’e])
>>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d’,’e]) >>> print(s1.tail())
>>> print(s1.tail(2)) Output:
Output: a 1
d 4 b 2
e 5 c 3
dtype: int64 d 4
e 5
dtype: int64
pandas count() function:
• Returns the number of non-NaN values in the

series.

>>> s1=pd.Series([1,2,np.nan,4,5],
>>> s1=pd.Series([1,2,3,4,5], index= ['a','b','c','d','e'])

index=['a','b','c','d','e'])
>>> print(s1.count())
>>> print(s1.count())
output
output
5 4
Homework
Consider the following code:
>>> s1=pd.Series([12,np.nan,10])
>>> print(s1)
Find the output and write a python statement to count and display only non null
values in the above series.
Output
0 12.0 ii) >>> s1.count()
2
1 NaN
2 10.0
dtype: float64
Series Object Attributes:
Properties of a series through its associated attributes.
1) Series. index  returns index of the series
2) Series. values  returns ndarray
3) Series. dtype  returns dtype object of the underlying data.
4) Series. shape  returns tuple of the shape of the underlying data.
5) Series. nbytes  returns number of bytes of underlying data.
6) Series. ndim  returns the number of dimension
7) Series. size  returns number of elements.
8) Series. hasnans  returns true if there is any NaN
9) Series. empty  returns true if series object is empty.
Naming the Series and the index column
>>> >>> s1 = pd.Series({'Jan':31,"Feb":28,"Mar":31,"Apr":30})
>>> s1.name="Days"
>>> s1.index.name="Months"
>>> print(s1)
Output:
Months
Jan 31
Feb 28
Mar 31
Apr 30
Name: Days, dtype: int64
>>> s1 = pd.Series( range(1, 15, 3), index= [x for x in 'abcde'])
>>> s1.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> s1.values
array([ 1, 4, 7, 10, 13], dtype=int64)
>>> s1.dtype
dtype('int64')
>>> s1.shape
(5,)
>>> s1.nbytes
40
>>> s1.ndim
1
>>> s1.size
5
>>> s1.hasnans
False
>>> s1.empty
False
Sumitha Arora pg no 297 class 11
• Int 8  1 byte
• Int 16  2 bytes
Mathematical operations with Series
e.g.1: e.g.2:
import pandas as pd import pandas as pd
>>> s1 = pd.Series([1,2,3]) >>> s1 = pd.Series([1,2,3])
>>> s2 = pd.Series([1,2,4]) >>> s2 = pd.Series([1,2,4])
>>> s3 = s1+s2 >>> s3 = s1 * s2
>>> print(s3) >>> print(s3)
Output: Output:
0 2 0 1
1 4 1 4
2 7 2 12
e.g. 4
e.g. 3
>>> s1 = np. arange(10,15)
>>> s1 = np. arange(10,15)
>>> s2 = pd.Series(index= s1, data= s1**4)
>>> s2 = pd.Series(index= s1, data= s1 *4)
>>> print(s2)
>>> print(s2)
Output:
Output:
10 10000
10 40
11 14641
11 44
12 20736
12 48
13 28561
13 52
14 38416
14 56
dtype: int32
dtype: int32
e.g. 6
e.g. 5 concat your firstname with your lastname
>>> import pandas as pd >>>import pandas as pd
>>> data =['I','n','f','o','r’] >>> s1 = [ 'a',’ b’, 'c’]
>>> s1 = pd.Series(data+['m','a','t','i','c','s’])
>>> s1 >>> s2 = pd.Series(data= s1 *2)
Output: >>> print(s2)
0 I
1 n Output:
2 f 0 a
3 o
4 r 1 b
5 m 2 c
6 a
7 t 3 a
8 i 4 b
9 c
10 s 5 c
dtype: object dtype: object
Note :
• Arithmetic operations is possible on objects of same index;
otherwise will result as NaN
Homework:
Differentiate between Numpy Arrays and Series

objects
Draw tables of subtraction showing the changes in

the series elements and corresponding output
without replacing missing values and after replacing
the missing values with 1000.
Homework
>>> seriesA=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> seriesB=pd.Series([10,20,-10,-50,100],index=['z','x','a','c','e'])
>>> print(seriesA-seriesB)
Output
a 11.0
b NaN
c 53.0
d NaN
e -95.0
x NaN
z NaN
dtype: float64
Subtraction after replacing with NaN values
>>> print(seriesA.sub(seriesB,fill_value=1000))
a 11.0
b -998.0
c 53.0
d -996.0
e -95.0
x 980.0
z 990.0
dtype: float64

Unit I: Data Handling Using Pandas and Data Visualization: Marks:30

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit I: Data Handling Using Pandas and Data Visualization: Marks:30

Uploaded by

Copyright:

Available Formats

Unit I:

Data Handling using Pandas and

• We have to import these libraries for calling its functions

• NumPy  Numerical Python, uses a multidimensional array object and has

• Matplotlib It is used for 2D graph plotting graphs and visualization. It is

For example, the following series is a collection of integers.

Basic feature of series are

A series can also be described as an ordered dictionary with mapping of

Example of series–type objects

Index Data Index Data Index Data

Basic feature of DataFrame are

# Example 1: Empty Series using Series() Method

>>> import numpy as np

There are three methods for data selection:

column name. It refers to name-based indexing .

loc = [< list of row names>, <list of column names>]

and column number. It refers to position-based indexing.

iloc =[<row number range>,<column number range>]

• Sorting on basis of values:

>>>print(s1.sort_values()) >>> print(s1.sort_values(ascending=False))

>>> s1.sort_index() >>> s1.sort_index(ascending=False)

• Returns the number of non-NaN values in the

>>> import pandas as pd

>>> s1=pd.Series([1,2,3,4,5], index= ['a','b','c','d','e'])

Differentiate between Numpy Arrays and Series

Draw tables of subtraction showing the changes in

You might also like