You are on page 1of 11

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

DATA HANDLING USING PANDAS–I – SERIES


INTRODUCTION: The process of analyzing a large set of data, which enables to answering questions related to
data set, is known as Data Science or Data Analytics.

Data Analytics is necessary to handle huge data. Before analyzing data, the data is to be processed as the
data may not be readily available for analyzing. The data is generally available in different formats like CSV file,
Excel file, HTML file etc. and all these formats are to be converted into a single format.

The analysis of data will have sequence of steps like converting data of different types in to one type,
storing it, performing operations like join, merge, search etc. and plotting data in form of a graph. Python
supports different libraries for all these sequence of operations for data analysis.

Python Pandas is a library that enables data analysis, with various methods available in it

PANDAS: It is a high–level data manipulation tool developed by Wes McKinney for data analysis and
visualization work. It offers powerful and flexible data structures to make data analysis and manipulation easy.
The term ‘Pandas’ is derived from ‘Panel data system’, which is a term used for multidimensional, structured
data set. Pandas provide easy to use data structures and data analysis tools.

Features of Pandas: Pandas is the most popular library in scientific Python ecosystem for doing data analysis.
Pandas can handle several tasks related to data processing and offers the following features

 It can read or write in many different data formats like integers, float, double etc.
 Columns from a Pandas data structure can be deleted or inserted
 It supports group by operation for data aggregation and transformations, and allows high performance
merging and joining of data
 It offers good I/O capabilities as it easily data from a MySQL database directly into a dataframe
 It can easily select subsets of data from bulky datasets and can even combine multiple data sets together
 It has the functionality to find and fill missing data
 It allows to apply operations to independent groups within the data
 It supports reshaping of data into different forms
 It supports advanced time–series functionality, which is the use of a model to predict future values based
on previously observed values
 It supports visualization by integrating libraries such as matplotlib and seaborn etc. Pandas is best at
heading huge tabular datasets comprising different data formats

INSTALLING PANDAS: The procedure for installing Pandas is as follows


Step 1: Open Command Prompt as an Administrator
Step 2: Type cd\ to move to the root directory
Step 3: Type the following command by ensuring internet connectivity
pip install pandas

DATA STRUCTURES IN PANDAS: A data structure is a specialized format for organizing, processing, retrieving
and storing data. Python Pandas provides three data structures namely, Series, Dataframes and Panel

 Series: It is a one–dimensional structure storing homogeneous(all data elements of same type) mutable
data

 Dataframes: It is a two–dimensional structure storing heterogeneous(data elements may be of different


data types) mutable data

 Panel: It is a three–dimensional way of storing items

SERIES: A series is a one–dimensional array like structure with homogeneous data. i.e. all the data elements in
the series are of same type. However, the data elements may be of any type like integer, string, float, object etc.

Ex1: 10 23 56 17 52 61 73 26

Ex2: 1.5 2.6 38.5 45.2 9.7 2.0 3.8 6.4

Ex3: App Box Car Doll ENT 1234 CBSE Mango


EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 2

A series can also be described as an ordered dictionary with mapping of index values to data values

Index Data Index Data Index Data


0 22 Jan 31 Sunday 1
1 –14 Feb 28 Monday 2
2 52 Mar 31 Tuesday 3
3 100 April 30 Wednesday 4

Some characteristics of Series are,

 All data elements in a series are homogeneous i.e. of same data type
 The size of series is immutable i.e. the size of series is not alterable. Hence, it is not possible to add or
remove data elements after creating a series
 The values of data are mutable i.e. the values of data elements can be changed in a series

Creating a Series: A series can be created by using Series( ) method with various inputs like (i) List (ii) Scalar
Value or Constant (iii) Dictionary (iv) Array etc.

To use Series( ) method to create a series, the library “pandas” is to be imported using the import
statement, like below

import pandas (Or)


import pandas as pd

1. Creating an Empty Series: An empty series can be created by using the Series( ) function, without any
parameters.

Syntax : import pandas as pd


<Series_Object> = pd.Series( )

Ex : >>>mtsrs = pd.Series( )
>>> mtsrs
Series([ ], dtype: float64)
Here,
 mtsrs is the series variable
 Series( ) method creates an empty list, with default data type
 The dtype indicates the data type of the elements of the series
 pd is an alternate name given to the pandas module. Hence, instead of the module name ‘pandas’
the short name ‘pd’ can be used

2. Creating a Series using List: A list can be passed as an argument to Series( ) function to create a series.
The syntax for creating a series using list is,

Syntax : import pandas as pd


<Series_Object> = pd.Series(data, index=idx )

Here, data can be a list, or dictionary or scalar value

Index is the numeric value displayed with given values. Providing index is
optional, and the default index starts from 0

Ex : >>> daysinmonths=pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
>>> daysinmonths
0 31
1 28
2 31
3 30
4 31
5 30
6 31
7 31
8 30
9 31
10 30
11 31
dtype: int64
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 3

When the index is not provided, the default index starts from 0 and ranges up to len–1. However, index
can also be provided while creating a series using the argument index

Ex: >>> srs_week=pd.Series(["Sun","Mon","Tue","Wed","Thu","Fri","Sat"], index=[1,2,3,4,5,6,7])


>>> srs_week
1 Sun
2 Mon
3 Tue
4 Wed
5 Thu
6 Fri
7 Sat

Index can be assigned to a series at the time of creating the series or even after creating series

Ex : >>> srs_nat=pd.Series([1,2,3,4,5])
>>> srs_nat
0 1
1 2
2 3
3 4
4 5

>>> srs_nat.index=["First","Second","Third","Fourth","Fifth"]
>>> srs_nat
First 1
Second 2
Third 3
Fourth 4
Fifth 5

If a single value is in float in series, then the rest of the integer values will be converted into float and
hence when the series was displayed, it will be displayed as a float series

Ex: >>> srs_test=pd.Series([2,5,8,9.4,18])


>>> srs_test
0 2.0
1 5.0
2 8.0
3 9.4
4 18.0
dtype: float64

3. Creating Series by providing data with range( ) function: The sequence of values generated using
range( ) function can be used to create a Series
Ex: >>> srs_data=pd.Series(range(3,20,4))
>>> srs_data
0 3
1 7
2 11
3 15
4 19

4. Create Series from Scalar or Constant Value: A series can be created for a scalar or constant value. In
this case, it is possible to provide only one scalar value
Ex: >>> srs_const=pd.Series(18)
>>> srs_const
0 18
dtype: int64

If index is provided that index will be applicable to the scalar value and if more indices provided all the
indices will have the same scalar value
Ex: >>> srs_const=pd.Series(18,['h','i','j','k'])
>>> srs_const
h 18
i 18
j 18
k 18
dtype: int64
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 4

The range( ) function can also be applied to provide indices while creating series

Ex: >>> srs_const=pd.Series(33,range(3,15,4))


>>> srs_const
3 33
7 33
11 33
dtype: int64

5. Creating Series with Index of String (Text) Type: A string can also be specified as an index to an element
of series.

Ex: >>> srs_num=pd.Series([9,50,17,–6,0],["Odd","Even","Prime","Negative","Zero"])


>>> srs_num
Odd 9
Even 50
Prime 17
Negative –6
Zero 0
dtype: int64

6. Creating a Series with range( ) and for loop: The data and indices can be generated using range( ) function
and for loop as well.

However, to generate numeric values either for data or indices the range( ) function alone can be used without
using for statement.

Ex: >>> srs1=pd.Series(range(11,20,2),range(1,10,2))


>>> srs1
1 11
3 13
5 15
7 17
9 19
dtype: int64

But, to generate characters as data or index, the range function along with for to be used, as follows

Ex1: >>> srs2=pd.Series([11,22,33,44,55],index=[i for i in 'apple'])


>>> srs2
a 11
p 22
p 33
l 44
e 55
dtype: int64

Ex2: >>> srs3=pd.Series([ch for ch in "Navodaya"],index=[i for i in 'udaigiri'])


>>> srs3
u N
d a
a v
i o
g d
i a
r y
i a
dtype: object

7. Creating a Series using two different lists: A series can be created by providing data as one list and the
indices as the other list

Ex: >>> srs_num=pd.Series(["One","Two","Three","Four","Five"], index=[1,2,3,4,5])


>>> srs_num
1 One
2 Two
3 Three
4 Four
5 Five
dtype: object
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 5

8. Creating a Series by using NaN for missing values: A series having missing numbers can be created. For
this purpose the constant NaN of NumPy library can be used for missing numbers. The NaN of NumPy
library can be accessed using the statement np.NaN, where np is equivalent to import numpy as np

Ex: >>> import numpy as np


>>> srs_sales = pd.Series ([536, 486, np. NaN, 472, 86, np.NaN, 145], index = ["Sun", "Mon",
Tue", "Wed", "Thu", "Fri", "Sat"])
>>> srs_sales
Sun 536.0
Mon 486.0
Tue NaN
Wed 472.0
Thu 86.0
Fri NaN
Sat 145.0
dtype: float64

9. Creating a Series from Dictionary: A series can also be created using a Dictionary. However, a dictionary
is collection of elements, where each element is a combination of Key and Value. As every element of
dictionary is already having a key, the series should not possess a separate key while declaring.

Ex: >>> srs_month1=pd.Series({"Jan":31, "Feb":28, "Mar":31, "Apr":30, "May":31, "June":30})


>>> srs_month1
Jan 31
Feb 28
Mar 31
Apr 30
May 31
June 30
dtype: int64

Ex2: >>> srs_month2=pd.Series({31:"July", 31:"Aug", 30:"Sep", 31:"Oct", 30:"Nov", 31:"Dec"})


>>> srs_month2
31 Dec
30 Nov
dtype: object

10. Creating a Series using Mathematics Expression / Function: The data values or index values for a series
object can also be provided, from a result of expression or function.

Ex1: >>> srs1=pd.Series(data=[11,22,33,44],index=[1,1+1,2+1,1+3])


>>> srs1
1 11
2 22
3 33
4 44
dtype: int64

Ex2: >>> d=np.arange(10,100,20)


>>> i=d//10
>>> s1=pd.Series(d,i)
>>> s1
1 10
3 30
5 50
7 70
9 90
dtype: int32

Ex3: >>> idx=np.arange(10,15)


>>> srs=pd.Series(index=idx,data=idx**2)
>>> srs
10 100
11 121
12 144
13 169
14 196
dtype: int32
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 6

Accessing Data from a Series:

1. A series can be accessed by using the name of the series


Ex: >>> srs_prime=pd.Series([2,3,5,7,11,13,17,19])
>>> srs_prime
0 2
1 3
2 5
3 7
4 11
5 13
6 17
7 19

2. Individual element(s) of a series can be accessed using position / index.


Ex: >>> srs_prime=pd.Series([2,3,5,7,11,13,17,19])

>>> srs_prime[3]
7

>>> srs_prime[[2,4,7]]
2 5
4 11
7 19

3. A sequence of elements of a series can be accessed by applying slicing on the series


Ex: >>> srs_odd=pd.Series([1,3,5,7,9,11,13,15,17,19], index=['a','b','c','d','e','f','g','h','i','j'])

>>> srs_odd[:3]
a 1
b 3
c 5
dtype: int64

>>> srs_odd[2:8]
c 5
d 7
e 9
f 11
g 13
h 15
dtype: int64

>>> srs_odd[4:10:3]
e 9
h 15
dtype: int64

>>> srs_odd[–3:]
h 15
i 17
j 19
dtype: int64

4. Elements of a series can also be accessed by using iloc and loc


 iloc: It is used for indexing or slicing based on position, i.e., by row number and column
number. It refers to position–based indexing. The syntax for using iloc is,

Syntax: iloc = [<row number range>, <column number range>]

 loc: It is used for indexing or selecting based on name, i.e., by row name and column name. It
refers to name–based indexing. The syntax for using loc is,

Syntax: loc = [<list of row names>, <list of column names>]


EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 7

Ex:
>>> weeksrs = pd.Series (index = ["S", "M", "T", "W", "Th", "F", "Sa"],
data = ["Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"])
>>> weeksrs
S Sunday
M Monday
T Tuesday
W Wednesday
Th Thursday
F Friday
Sa Saturday
dtype: object

>>> weeksrs.iloc[2 : 5]
T Tuesday
W Wednesday
Th Thursday
dtype: object

>>> weeksrs.loc["M" : "F"]


M Monday
T Tuesday
W Wednesday
Th Thursday
F Friday
dtype: object

Naming a Series: To name the values and index of a series, the name property can be used. The name assigned
to the index will be displayed above the index and the name assigned to values will be displayed at the bottom of
the series

Ex: >>> srs=pd.Series(["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],index=[1, 2, 3, 4, 5, 6, 7])
>>> srs
1 Sun
2 Mon
3 Tue
4 Wed
5 Thu
6 Fri
7 Sat
dtype: object

>>> srs.name="Day"
>>> srs.index.name="S.No."

>>> srs
S.No.
1 Sun
2 Mon
3 Tue
4 Wed
5 Thu
6 Fri
7 Sat
Name: Day, dtype: object

Series Object Attributes: The various properties of a series can be accessed by using its attributes. The syntax
for accessing an attribute with Series Object is,

<Series_Object>  <Attribute_Name>
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 8

Some common attributes related to series object are as follows,

Attribute Description
Series.index Returns index of the series

Series.values Returns ndarray having values of series

Series.dtype Returns data type of the data in series


Series.shape Returns shape of data in form of a tuple

Series.nbytes Returns number of bytes occupied by series data

Series.ndim Returns the number of dimension


Series.size Returns number of elements

Series.hasnans Returns true, if any NaN values are present

Series.empty Returns true, if series object is empty

Ex: >>> sales = pd.Series ([536, 486, np.NaN, 472, 86, np.NaN, 145], index = ["Sun", "Mon","Tue",
"Wed", "Thu", "Fri", "Sat"])

>>> sales.index
Index(['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'], dtype='object')

>>> sales.values
array([536., 486., nan, 472., 86., nan, 145.])

>>> sales.dtype
dtype('float64')

>>> sales.shape
(7,)

>>> sales.nbytes
56

>>> sales.ndim
1

>>> sales.size
7

>>> sales.hasnans
True

>>> sales.empty
False

Retrieving Values from a Series using head( ) and tail( ) functions:

 The head( ) function, when invoked with a series object, returns the specified number of rows from top.
By default, this function fetches 5 rows

Ex: >>> srs=pd.Series(data=range(1,100,10),index=range(0,10))


>>> srs
0 1
1 11
2 21
3 31
4 41
5 51
6 61
7 71
8 81
9 91
dtype: int64
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 9

>>> srs.head( )
0 1
1 11
2 21
3 31
4 41
dtype: int64

>>> srs.head(3)
0 1
1 11
2 21
dtype: int64

 The tail( ) function, when invoked with a series object, returns the specified number of rows from
bottom. By default, this function fetches 5 rows from bottom
Ex: >>> srs=pd.Series(data=range(1,100,10),index=range(0,10))
>>> srs
0 1
1 11
2 21
3 31
4 41
5 51
6 61
7 71
8 81
9 91
dtype: int64

>>> srs.tail( )
5 51
6 61
7 71
8 81
9 91
dtype: int64

>>> srs.tail(7)
3 31
4 41
5 51
6 61
7 71
8 81
9 91
dtype: int64

Mathematical Operations on Series: It is possible to perform mathematical / arithmetic operations, such as


addition (+), subtraction (–), multiplication (*), division (/) etc. on series.

To perform arithmetic operations, the index of the series in operation must be same; otherwise the
operation results into producing NaN values.

>>> srs1 >>> srs2 >>> srs3


1 11 1 21 7 31
2 12 2 22 8 32
3 13 3 23 9 33
4 14 4 24 10 34
dtype: int64 dtype: int64 dtype: int64

Now,
>>> srs1+srs2 >>> srs2–srs1 >>> srs1*srs2 >>> srs2/srs1
1 32 1 10 1 231 1 1.909091
2 34 2 10 2 264 2 1.833333
3 36 3 10 3 299 3 1.769231
4 38 4 10 4 336 4 1.714286
dtype: int64 dtype: int64 dtype: int64 dtype: float64
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 10

But,
>>> srs1+srs3 >>> srs3–srs2 >>> srs3*srs1 >>> srs2/srs3
1 NaN 1 NaN 1 NaN 1 NaN
2 NaN 2 NaN 2 NaN 2 NaN
3 NaN 3 NaN 3 NaN 3 NaN
4 NaN 4 NaN 4 NaN 4 NaN
7 NaN 7 NaN 7 NaN 7 NaN
8 NaN 8 NaN 8 NaN 8 NaN
9 NaN 9 NaN 9 NaN 9 NaN
10 NaN 10 NaN 10 NaN 10 NaN
dtype: float64 dtype: float64 dtype: float64 dtype: float64

Vector Operations on Series: It is possible to perform Vector Operations on series. i.e. Arithmetic operations
such as addition(+), subtraction(–), multiplication(*), division(/) etc. on series can be performed with a scalar
value (constant)

Ex:
>>> srs
1 11
2 12
3 13
4 14
dtype: int64

>>> srs+15 >>> 10–srs >>> srs*0.75


1 26 1 –1 1 8.25
2 27 2 –2 2 9.00
3 28 3 –3 3 9.75
4 29 4 –4 4 10.50
dtype: int64 dtype: int64 dtype: float64

>>> 25/srs >>> srs**3 >>> srs>12.5


1 2.272727 1 1331 1 False
2 2.083333 2 1728 2 False
3 1.923077 3 2197 3 True
4 1.785714 4 2744 4 True
dtype: float64 dtype: int64 dtype: bool

Retrieving Values using Conditions: While displaying the series, condition can be applied using relational
operators, like below

Ex: >>> numsrs=pd.Series([1, 2, 3, 4, 5, 6], [11, 22, 33, 44, 55, 66])

>>> numsrs[numsrs<3]
11 1
22 2
dtype: int64

>>> numsrs[numsrs>=4]
44 4
55 5
66 6
dtype: int64

Deleting Elements from a Series: An element in a series can be deleted by passing the index of the element to be
deleted to the method drop( ). When this function is used, it actually does not change the Series Object, as it is
immutable, but creates another Series Object internally and displays it.

The syntax of using drop( ) method is as follows

Syntax: <Series_Object>  drop(Index_of_Element)


EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 11

Ex: >>> primesrs=pd.Series([2, 3, 5, 7, 9, 11, 13])

>>> primesrs
0 2
1 3
2 5
3 7
4 9
5 11
6 13
dtype: int64

>>> primesrs.drop(4)
0 2
1 3
2 5
3 7
5 11
6 13
dtype: int64

>>> primesrs
0 2
1 3
2 5
3 7
4 9
5 11
6 13
dtype: int64

Sorting Series Values: The sort_values( ) function can be used to display the sorted Series Object. This function
displays the Series Object in sorted order of data items, but never changes the Series Object.

The syntax of using sort_values( ) method is as follows

Syntax: <Series_Object>  sort_values( )

Ex: >>> srs=pd.Series([18,25,13,90,35])


>>> srs
0 18
1 25
2 13
3 90
4 35
dtype: int64

>>> srs.sort_values()
2 13
0 18
1 25
4 35
3 90
dtype: int64

>>> srs
0 18
1 25
2 13
3 90
4 35
dtype: int64

You might also like