You are on page 1of 80

Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with 'relational' or 'labeled' data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in Python.

List of Pandas Exercises:

pandas is well suited for many different kinds of data:

 Tabular data with heterogeneously-typed columns, as in an SQL table or Excel


spreadsheet

 Ordered and unordered (not necessarily fixed-frequency) time series data.

 Arbitrary matrix data with row and column labels

 Any other form of observational / statistical data sets.

pip install pandas

Alternatively, if you're currently viewing this article in a Jupyter notebook you


can run this cell:

!pip install pandas


The ! at the beginning runs cells as if they were in a terminal.
To import pandas we usually import it with a shorter name since it's used so
much:

import pandas as pd
Now to the basic components of pandas.

Core components of pandas: Series and DataFrames


The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table
made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do
with one you can do with the other, such as filling in null values and calculating
the mean.

Import the following commands to start:

import pandas as pd
import numpy as np
Pandas version:

import pandas as pd
print(pd.__version__)
Key and Imports

pandas DataFrame object

pandas Series object

Create DataSeries:
import pandas as pd

s = pd.Series([2, 4, 6, 8, 10])

print(s)

Sample Output:

0 2
1 4
2 6
3 8

4 10
dtype: int64

Create Dataframe:

import pandas as pd

df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':
[86,97,96,72,83]});

print(df)

Sample Output:

X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83

Create a Series in python – pandas


Series is a one-dimensional labeled array capable
of holding data of any type (integer, string, float,
python objects, etc.).There are different ways to
create a series in python pandas (create empty
series, series from array without index, series from
array with index, series from dictionary and scalar
value ). The axis labels are called as indexes.

Create an Empty Series:


A basic series, which can be created is an Empty
Series. Below example is for creating an empty
series.

1
2
3
4
5

# Example Create an Empty Series


import pandas as pd
s = pd.Series()
print s
output:

Series([], dtype: float64)

Create a series from array without index:


Lets see an example on how to create series from
an array.

1
2
3
4
5
6
7

# Example Create a series from array


import pandas as pd
import numpy as np
data = np.array(['a','b','c','d','e','f'])
s = pd.Series(data)
print s
output:
0  a
1 b
2 c
3 d
4 e
5 f
dtype: object
Create a series from array with index:
This example depicts how to create a series in
python with index, Index starting from 1000 has
been added in the below example.

1
2
3
4
5
6
7
# Example Create a series from array with
specified index

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d','e','f'])
s=
pd.Series(data,index=[1000,1001,1002,1003,1004,1
005])
print s
output:
1000   a
1001   b
1002   c
1003   d
1004   e
1005   f
dtype: object

Create a series from Dictionary


This example depicts how to create a series in
python with dictionary. Dictionary keys are used to
construct index.
#Example Create a series from dictionary
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s
Index order is maintained and the missing element
is filled with NaN (Not a Number). So the output will
be

output:
b   1.0
c   2.0
d   NaN
a   0.0
dtype: float64

Create a series from Scalar value


This example depicts how to create a series in
python from scalar value. If data is a scalar value,
an index must be provided. The value will be
repeated to match the length of index
# create a series from scalar

import pandas as pd
import numpy as np
s = pd.Series(7, index=[0, 1, 2, 3])
print s
output:
0 7
1 7
2 7
3 7
dtype: int64

How to Access the elements of a Series in


python – pandas

Accessing Data from Series with Position in python


pandas
Retrieve Data Using Label (index) in python
pandas
Accessing data from series with position:
Accessing or retrieving the first element:
Retrieve the first element. As we already know, the
counting starts from zero for the array, which means
the first element is stored at zeroth position and so
on.

# create a series
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d','e','f'])
s = pd.Series(data)

#retrieve the first element


print s[0]
output:

a
Access or Retrieve the first three elements in
the Series:

# create a series
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d','e','f'])
s = pd.Series(data)
# retrieve first three elements
print s[:3]
output:

0 a
1 b
2 c
dtype: object

Access or Retrieve the last three elements in the


Series:

# create a series
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d','e','f'])
s = pd.Series(data)

# retrieve last three elements


print s[-3:]
output:

3 d
4 e
5 f
dtype: object

Accessing data from series with Labels or index:


A Series is like a fixed-size dictionary in that you
can get and set values by index label.

Retrieve a single element using index label:

# create a series
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d','e','f'])
s=pd.Series(data,index=[100,101,102,103,104,105])
print s[102]

output:
c

Retrieve multiple elements using index labels:


# create a series
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d','e','f'])
s=pd.Series(data,index=[100,101,102,103,104,105])

# retrieve multiple elements with labels or index

print s[[102,103,104]]
output:
102 c
103 d
104 e

dtype: object

Note: If label or index is not mentioned properly an


exception will be raised.

http://www.datasciencemadesimple.com/access-elements-series-python-pandas/

Python Pandas - Series

Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.

pandas.Series
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −

Sr.No Parameter & Description

1
data
data takes various forms like ndarray, list, constants
2
index
Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is
passed.

3
dtype
dtype is for data type. If None, data type will be inferred

4
copy
Copy data. Default False

A series can be created using various inputs like −

 Array
 Dict
 Scalar value or constant

Create an Empty Series


A basic series, which can be created is an Empty Series.

Example

#import the pandas library and aliasing as pd


import pandas as pd
s = pd.Series()
print s

Its output is as follows −
Series([], dtype: float64)

Create a Series from ndarray


If data is an ndarray, then index passed must be of the same length. If no index is passed,
then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
Example 1

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s

Its output is as follows −
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0
to len(data)-1, i.e., 0 to 3.

Example 2

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s

Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed values in the
output.

Create a Series from dict


A dict can be passed as input and if no index is specified, then the dictionary keys are
taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.

Example 1

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s

Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.

Example 2

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s

Its output is as follows −
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Observe − Index order is persisted and the missing element is filled with NaN (Not a
Number).

Create a Series from Scalar


If data is a scalar value, an index must be provided. The value will be repeated to match
the length of index
Live Demo
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s

Its output is as follows −
0 5
1 5
2 5
3 5
dtype: int64

Accessing Data from Series with Position


Data in the series can be accessed similar to that in an ndarray.

Example 1

Retrieve the first element. As we already know, the counting starts from zero for the array,
which means the first element is stored at zeroth position and so on.
Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element


print s[0]

Its output is as follows −
1

Example 2
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from
that index onwards will be extracted. If two parameters (with : between them) is used,
items between the two indexes (not including the stop index)
Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element


print s[:3]

Its output is as follows −
a 1
b 2
c 3
dtype: int64

Example 3

Retrieve the last three elements.


Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element


print s[-3:]

Its output is as follows −
c 3
d 4
e 5
dtype: int64

Retrieve Data Using Label (Index)


A Series is like a fixed-size dict in that you can get and set values by index label.

Example 1

Retrieve a single element using index label value.


Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element


print s['a']
Its output is as follows −
1

Example 2

Retrieve multiple elements using a list of index label values.


Live Demo
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements


print s[['a','c','d']]

Its output is as follows −
a 1
c 3
d 4
dtype: int64

Example 3

If a label is not contained, an exception is raised.


import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements


print s['f']

Its output is as follows −

KeyError: 'f'
Python Programs

# 1.Creating series from list


import pandas as pd
import numpy as np
S1=pd.Series([101,102,103,104,105])
print(S1)

>>>

0 101
1 102
2 103
3 104
4 105
dtype: int64

# 2.Assigning index to elements of Series


S1=pd.Series([101,102,103,104,105],index=['A1','B1','C1','D1','E1'])

print(S1)

>>>
A1 101
B1 102
C1 103
D1 104
E1 105
dtype: int64

#3.Create series using range() function


S2=pd.Series(range(10,21))

print(S2)

>>>

0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
dtype: int64
#4.Create series using range() function and
changing data type
S2=pd.Series(range(10),dtype='float32')
print(S2)

>>>
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
dtype: float32

#5.Printing Series elements and Series indexes

S3=pd.Series([20,np.NaN,np.NaN,45,67,89,54,45,23],index=['Anil',
'BN','BM','Ankit','Ram','Vishal','Ankita','Lokesh','Venkat'])

print(S3)
print(S3.index)
print(S3.values)
print(S3.dtype)
print(S3.shape)
print(S3.nbytes)
print(S3.ndim)
print(S3.itemsize)
print(S3.size)
print(S3.hasnans)

>>>
Anil 20.0
BN NaN
BM NaN
Ankit 45.0
Ram 67.0
Vishal 89.0
Ankita 54.0
Lokesh 45.0
Venkat 23.0
dtype: float64

Index(['Anil', 'BN', 'BM', 'Ankit', 'Ram', 'Vishal', 'Ankita',


'Lokesh','Venkat'],

dtype='object')
[20. nan nan 45. 67. 89. 54. 45. 23.]
float64
(9,)
72
1
8
9
True

#6.Accessing elements of Series

print(S3)

Anil 20.0
BN NaN
BM NaN
Ankit 45.0
Ram 67.0
Vishal 89.0
Ankita 54.0
Lokesh 45.0
Venkat 23.0
dtype: float64

print(S3[6])

>>>54.0
print(S3[:2])

>>>
Anil 20.0
BN NaN
dtype: float64

print(S3[1:4])

>>>
BN NaN
BM NaN
Ankit 45.0
dtype: float64

#7.Series with two different Lists

dayno=[1,2,3,4,5,6,7]
dayname=["Monday","Tuesday","Wednesday","Thursday","Friday",
"Saturday","Sunday"]

ser_week=pd.Series(dayname,index=dayno)
print(ser_week)

>>>
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
dtype: object
#8.Creating series with integer, Nan and float
values
#Look at the change of data type of Series

#import numpy as np
S1=pd.Series([101,102,103,104,np.NaN,90.7])
print(S1)

>>>

0 101.0
1 102.0
2 103.0
3 104.0
4 NaN
5 90.7
dtype: float64

#9. Creating Series from dictionary


# Keys become index no. and values become Columns
# Check the change in data type

D1={'1':'Monday','2':'Tuesday','3':'Wednesday','4':'Thursday',
'5':'Friday','6':'Saturday','7':'Sunday'}
print(D1)
S5=pd.Series(D1)
print(S5)

>>>
{'1': 'Monday', '2': 'Tuesday', '3': 'Wednesday', '4': 'Thursday',
'5': 'Friday', '6': 'Saturday', '7': 'Sunday'}

1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
dtype: object
#10.Creating Series using a scalar/constant value

S9=pd.Series(90.7,index=['a','b','c','d','e','f','g'])
print(S9)

>>>
a 90.7
b 90.7
c 90.7
d 90.7
e 90.7
f 90.7
g 90.7
dtype: float64

S7=pd.Series(90)
print(S7)

>>>
0 90
dtype: int64
S8=pd.Series(90,index=[1])
print(S8)

>>>
1 90
dtype: int64

#11.Specifying range() function in index


attribute to generate a series object with
constant/scalar value
S90=pd.Series(95,index=range(5))
print(S90)

>>>
0 95
1 95
2 95
3 95
4 95
dtype: int64

#12. iloc() Method


S8=pd.Series([1,2,3,4,5,6,7],index=['a','b','c','d','e','f','g'])

print(S8.iloc[1:5])

>>>
b 2
c 3
d 4
e 5
dtype: int64

#13. loc() Method


print(S8.loc['b':'e'])

>>>
b 2
c 3
d 4
e 5
dtype: int64

#14.Extract those values of series for specified


index positions - take() Method

dayno=[91,92,93,94,95,96,97]
dayname=["Monday","Tuesday","Wednesday","Thursday","Friday",
"Saturday","Sunday"]

ser_week=pd.Series(dayname,index=dayno)

print(ser_week)
>>>

91 Monday
92 Tuesday
93 Wednesday
94 Thursday
95 Friday
96 Saturday
97 Sunday
dtype: object

pos=[0,2,5]
print(ser_week.take(pos))

>>>
91 Monday
93 Wednesday
96 Saturday
dtype: object

print(ser_week[91])

>>>
Monday

#15.Stack 2 Series horizontally

ss1=pd.Series([1,2,3,4,5],index=[11,12,13,14,15])

ss2=pd.Series(['a','b','c','d','e'])

print(ss1.append(ss2))

>>>
11 1
12 2
13 3
14 4
15 5
0 a
1 b
2 c
3 d
4 e
dtype: object
#Index numbers are repeated

print(ss1)
>>>
11 1
12 2
13 3
14 4
15 5
dtype: int64

print(ss2)
>>>
0 a
1 b
2 c
3 d
4 e
dtype: object

ss3=ss1.append(ss2)

print(ss3)

11 1
12 2
13 3
14 4
15 5
0 a
1 b
2 c
3 d
4 e
dtype: object

#Index numbers are repeated

head() and tail() methods


# 16 head () Function in Python (Get First N Rows):

head() function with no arguments gets the first five rows of data from the data
series .

# 17 Tail() Function in Python (Get Last N Rows):

tail() function with no arguments gets the last five rows of data from the data
series.
import pandas as pd

import pandas as pd

S8=pd.Series([1,2,3,4,5,6,7],index=['a','b','c','d','e','f','g'])
print("The Series is")
print(S8)

The Series is
a 1
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int64

print("Head function output")


print(S8.head())

Head function output


a 1
b 2
c 3
d 4
e 5
dtype: int64

print("Tail function output")

print(S8.tail())
Tail function output
c 3
d 4
e 5
f 6
g 7
dtype: int64

print(S8.head(7))
a 1
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int64

print(S8.tail(6))
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int64
print(S8.head(-4))
a 1
b 2
c 3
dtype: int64

#Creating a series using a mathematical expression/function

# Syntax:
# import pandas as pd
# <series Object>=pd.Series(index=None,data=<expression [function]>)

#To generate a series using a mathematical function ( exponentiation )

import pandas as pd
import numpy as np
s1=np.arange(10,15)
print(s1)

[10 11 12 13 14]

sobj=pd.Series(index=s1,data=s1**2)
print(sobj)

10 100
11 121
12 144
13 169
14 196
dtype: int32

#Mathematical Operations on Series

#All the arithmetic operators such as + , - , *, / etc can be


successfully performed on series.

import pandas as pd
s1=pd.Series([11,12,13,14],index=[1,2,3,4])
print("Series s1")
print(s1)

Series s1
1 11
2 12
3 13
4 14
dtype: int64

s2=pd.Series([21,22,23,24],index=[1,2,3,4])
print("Series s2")
print(s2)

Series s2
1 21
2 22
3 23
4 24
dtype: int64

s3=pd.Series([21,22,23,24],index=[101,102,103,104])
print("Series s3=")
print(s3)

Series s3=
101 21
102 22
103 23
104 24
dtype: int64

print(s1+s2)
print(s1*s2)
1 231
2 264
3 299
4 336
dtype: int64
print(s1/s2)

1 0.523810
2 0.545455
3 0.565217
4 0.583333
dtype: float64

print(s1+s3)

1 NaN
2 NaN
3 NaN
4 NaN
101 NaN
102 NaN
103 NaN
104 NaN
dtype: float64

#Vector operation on Series

print(s1+2)
1 13
2 14
3 15
4 16
dtype: int64

print(s2*3)

1 63
2 66
3 69
4 72
dtype: int64

print(s3**2)

101 441
102 484
103 529
104 576
dtype: int64

#Retrieving vlues using conditions

import pandas as pd
s=pd.Series([1.0000,1.414214,1.73205,2.000000])
print(s)
0 1.000000
1 1.414214
2 1.732050
3 2.000000
dtype: float64

print (s[s<2])

0 1.000000
1 1.414214
2 1.732050
dtype: float64

print (s[s>=2])

3 2.0
dtype: float64

# Deleting elements from series


#drop() method passing the index of the element to be deleted as the
argument

print(s)

0 1.000000
1 1.414214
2 1.732050
3 2.000000
dtype: float64

print(s.drop(3))

0 1.000000
1 1.414214
2 1.732050
dtype: float64
>>>
>>>
RESTART: C:/Users/naman/AppData/Local/Programs/Python/Python37-
32/panda-series.py
0 101
1 102
2 103
3 104
4 105
dtype: int64
A1 101
B1 102
C1 103
D1 104
E1 105
dtype: int64
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
dtype: int64
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
dtype: float32
Anil 20.0
BN NaN
BM NaN
Ankit 45.0
Ram 67.0
Vishal 89.0
Ankita 54.0
Lokesh 45.0
Venkat 23.0
dtype: float64
Index(['Anil', 'BN', 'BM', 'Ankit', 'Ram', 'Vishal', 'Ankita',
'Lokesh',
'Venkat'],
dtype='object')
[20. nan nan 45. 67. 89. 54. 45. 23.]
float64
(9,)
72
1
9
True
Anil 20.0
BN NaN
BM NaN
Ankit 45.0
Ram 67.0
Vishal 89.0
Ankita 54.0
Lokesh 45.0
Venkat 23.0
dtype: float64
54.0
Anil 20.0
BN NaN
dtype: float64
BN NaN
BM NaN
Ankit 45.0
dtype: float64
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
dtype: object
0 101.0
1 102.0
2 103.0
3 104.0
4 NaN
5 90.7
dtype: float64
{'1': 'Monday', '2': 'Tuesday', '3': 'Wednesday', '4': 'Thursday',
'5': 'Friday', '6': 'Saturday', '7': 'Sunday'}
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
dtype: object
a 90.7
b 90.7
c 90.7
d 90.7
e 90.7
f 90.7
g 90.7
dtype: float64
0 90
dtype: int64
1 90
dtype: int64
0 95
1 95
2 95
3 95
4 95
dtype: int64
b 2
c 3
d 4
e 5
dtype: int64
b 2
c 3
d 4
e 5
dtype: int64
91 Monday
92 Tuesday
93 Wednesday
94 Thursday
95 Friday
96 Saturday
97 Sunday
dtype: object
91 Monday
93 Wednesday
96 Saturday
dtype: object
Monday
11 1
12 2
13 3
14 4
15 5
0 a
1 b
2 c
3 d
4 e
dtype: object
11 1
12 2
13 3
14 4
15 5
dtype: int64
0 a
1 b
2 c
3 d
4 e
dtype: object
11 1
12 2
13 3
14 4
15 5
0 a
1 b
2 c
3 d
4 e
dtype: object
>>>

Pandas
pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with 'relationa' or 'labeled' data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in Python.

List of Pandas Exercises:

pandas is well suited for many different kinds of data:

 Tabular data with heterogeneously-typed columns, as in an SQL table or Excel


spreadsheet

 Ordered and unordered (not necessarily fixed-frequency) time series data.

 Arbitrary matrix data with row and column labels

 Any other form of observational / statistical data sets.

pip install pandas

Alternatively, if you're currently viewing this article in a Jupyter notebook you


can run this cell:

!pip install pandas


The ! at the beginning runs cells as if they were in a terminal.
To import pandas we usually import it with a shorter name since it's used so
much:

import pandas as pd
Now to the basic components of pandas.

Core components of pandas: Series and DataFrames


The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table
made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do
with one you can do with the other, such as filling in null values and calculating
the mean.

Import the following commands to start:

import pandas as pd
import numpy as np
Pandas version:

import pandas as pd
print(pd.__version__)
Key and Imports

pandas DataFrame object

pandas Series object

Create DataSeries:

import pandas as pd

s = pd.Series([2, 4, 6, 8, 10])

print(s)

Sample Output:

0 2
1 4
2 6
3 8

4 10
dtype: int64

Create Dataframe:

import pandas as pd

df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':
[86,97,96,72,83]});

print(df)

Sample Output:

X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion
in rows and columns.

Features of DataFrame

 Potentially columns are of different types


 Size – Mutable
 Labeled axes (rows and columns)
 Can Perform Arithmetic operations on rows and columns

Structure

Let us assume that we are creating a data frame with student’s data.
You can think of it as an SQL table or a spreadsheet data representation.

pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
The parameters of the constructor are as follows −

Sr.No Parameter & Description

data
data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.

index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is
passed.

columns
For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.

dtype
Data type of each column.

copy
This command (or whatever it is) is used for copying of data, if the default is False.

Create DataFrame
A pandas DataFrame can be created using various inputs like −

 Lists
 dict
 Series
 Numpy ndarrays
 Another DataFrame
In the subsequent sections of this chapter, we will see how to create a DataFrame using
these inputs.

Create an Empty DataFrame


A basic DataFrame, which can be created is an Empty Dataframe.

Example

Live Demo

#import the pandas library and aliasing as pd


import pandas as pd
df = pd.DataFrame()
print df

Its output is as follows −
Empty DataFrame
Columns: []
Index: []

Create a DataFrame from Lists


The DataFrame can be created using a single list or a list of lists.

Example 1

Live Demo

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df

Its output is as follows −
0
0 1
1 2
2 3
3 4
4 5

Example 2

Live Demo

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df

Its output is as follows −
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13

Example 3

Live Demo

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df

Its output is as follows −
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
Note − Observe, the dtype parameter changes the type of Age column to floating point.

Examples

Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}

>>> df = pd.DataFrame(data=d)

>>> df

col1 col2

0 1 3

1 2 4

Notice that the inferred dtype is int64.

>>> df.dtypes

col1 int64

col2 int64

dtype: object

To enforce a single dtype:


>>> df = pd.DataFrame(data=d, dtype=np.int8)

>>> df.dtypes

col1 int8

col2 int8

dtype: object

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),

... columns=['a', 'b', 'c'])

>>> df2

a b c

0 1 2 3

1 4 5 6

2 7 8 9

Attributes

Transpose index and columns.


Access a single value for a row/column label pair.
Dictionary of global attributes on this object.
Return a list representing the axes of the DataFrame.
The column labels of the DataFrame.
columns
Return the dtypes in the DataFrame.
dtypes
Indicator whether DataFrame is empty.
Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the length of the index
should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.

Example 1

Live Demo

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df

Its output is as follows −
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky
Note − Observe the values 0,1,2,3. They are the default index assigned to each using the
function range(n).

Example 2

Let us now create an indexed DataFrame using arrays.


Live Demo

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

Its output is as follows −
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
Note − Observe, the index parameter assigns an index to each row.

Create a DataFrame from List of Dicts


List of Dictionaries can be passed as input data to create a DataFrame. The dictionary
keys are by default taken as column names.

Example 1

The following example shows how to create a DataFrame by passing a list of dictionaries.
Live Demo

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

Its output is as follows −
a b c
0 1 2 NaN
1 5 10 20.0
Note − Observe, NaN (Not a Number) is appended in missing areas.

Example 2

The following example shows how to create a DataFrame by passing a list of dictionaries
and the row indices.
Live Demo

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df

Its output is as follows −
a b c
first 1 2 NaN
second 5 10 20.0

Example 3

The following example shows how to create a DataFrame with a list of dictionaries, row
indices, and column indices.
Live Demo

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a',
'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a',
'b1'])
print df1
print df2

Its output is as follows −
#df1 output
a b
first 1 2
second 5 10

#df2 output
a b1
first 1 NaN
second 5 NaN
Note − Observe, df2 DataFrame is created with a column index other than the dictionary
key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices
same as dictionary keys, so NaN’s appended.

Create a DataFrame from Dict of Series


Dictionary of Series can be passed to form a DataFrame. The resultant index is the union
of all the series indexes passed.

Example

Live Demo

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df

Its output is as follows −
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for
the d label, NaN is appended with NaN.
Let us now understand column selection, addition, and deletion through examples.

Column Selection
We will understand this by selecting a column from the DataFrame.

Example

Live Demo

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df ['one']

Its output is as follows −
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64

Column Addition
We will understand this by adding a new column to an existing data frame.

Example

Live Demo

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column


label by passing new series

print ("Adding a new column by passing as Series:")


df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df

print ("Adding a new column using the existing columns in DataFrame:")


df['four']=df['one']+df['three']

print df

Its output is as follows −
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN

Adding a new column using the existing columns in DataFrame:


one two three four
a 1.0 1 10.0 11.0
b 2.0 2 20.0 22.0
c 3.0 3 30.0 33.0
d NaN 4 NaN NaN

Column Deletion
Columns can be deleted or popped; let us take an example to understand how.

Example

Live Demo

# Using the previous DataFrame, we will delete a column


# using del function
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print df

# using del function


print ("Deleting the first column using DEL function:")
del df['one']
print df

# using pop function


print ("Deleting another column using POP function:")
df.pop('two')
print df

Its output is as follows −
Our dataframe is:
one three two
a 1.0 10.0 1
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4

Deleting the first column using DEL function:


three two
a 10.0 1
b 20.0 2
c 30.0 3
d NaN 4

Deleting another column using POP function:


three
a 10.0
b 20.0
c 30.0
d NaN

Row Selection, Addition, and Deletion


We will now understand row selection, addition and deletion through examples. Let us
begin with the concept of selection.

Selection by Label

Rows can be selected by passing row label to a loc function.


Live Demo

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.loc['b']

Its output is as follows −
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the Name of
the series is the label with which it is retrieved.

Selection by integer location

Rows can be selected by passing integer location to an iloc function.


Live Demo

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.iloc[2]

Its output is as follows −
one 3.0
two 3.0
Name: c, dtype: float64

Slice Rows

Multiple rows can be selected using ‘ : ’ operator.


Live Demo

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df[2:4]

Its output is as follows −
one two
c 3.0 3
d NaN 4

Addition of Rows

Add new rows to a DataFrame using the append function. This function will append the
rows at the end.
Live Demo

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])


df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print df

Its output is as follows −
a b
0 1 2
1 3 4
0 5 6
1 7 8

Deletion of Rows

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then
multiple rows will be dropped.
If you observe, in the above example, the labels are duplicate. Let us drop a label and will
see how many rows will get dropped.
Live Demo

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])


df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0


df = df.drop(0)

print df

Its output is as follows −
a b
1 3 4
1 7 8
In the above example, two rows were dropped because those two contain the same label
0
A Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns. In dataframe datasets arrange in rows and columns, we can store any
number of datasets in a dataframe. We can perform many operations on these datasets
like arithmetic operation, columns/rows selection, columns/rows addition etc.

Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create
a DataFrame one by one.

Creating an empty dataframe :


A basic DataFrame, which can be created is an Empty Dataframe. An Empty Dataframe is
created just by calling a dataframe constructor.
filter_none
edit
play_arrow
brightness_4
# import pandas as pd

import pandas as pd

# Calling DataFrame constructor

pd.DataFrame()

print(df)

Output :
Empty DataFrame
Columns: []
Index: []
 
Creating a dataframe using List:
DataFrame can be created using a single list or a list of lists.
filter_none
edit
play_arrow
brightness_4
# import pandas as pd

import pandas as pd

# list of strings

['Geeks', 'For', 'Geeks', 'is', 

       'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list

pd.DataFrame(lst)

print(df)

Output:

 
Creating DataFrame from dict of ndarray/lists:
To create DataFrame from dict of narray/list, all the narray must be of same length. If index
is passed then the length index should be equal to the length of arrays. If no index is
passed, then by default, index will be range(n) where n is the array length.
filter_none
edit
play_arrow
brightness_4
# Python code demonstrate creating 

# DataFrame from dict narray / lists 

# By default addresses.

import pandas as pd

# intialise data of lists.

data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}

# Create DataFrame

pd.DataFrame(data)

# Print the output.

print(df)

Output:

 
Create pandas dataframe from lists using dictionary:
Creating pandas data-frame from lists using dictionary can be achieved in different ways.
We can create pandas dataframe from lists using dictionary using pandas.DataFrame. With this
method in Pandas we can transform a dictionary of list to a dataframe.
filter_none
edit
play_arrow
brightness_4
# importing pandas as pd

import pandas as pd

# dictionary of lists

= {'name':["aparna", "pankaj", "sudhir", "Geeku"],

   'degree': ["MBA", "BCA", "M.Tech", "MBA"],

   'score':[90, 40, 80, 98]}

pd.DataFrame(dict)

print(df)

Output:

 
Multiple ways of creating dataframe :
 Different ways to create Pandas Dataframe
 Create pandas dataframe from lists using zip
 Create a Pandas DataFrame from List of Dicts
 Create a Pandas Dataframe from a dict of equal length lists
 Creating a dataframe using List
 Create pandas dataframe from lists using dictionary

Python Pandas - Series


Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.

pandas.Series()
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −

Sr.No Parameter & Description

data
data takes various forms like ndarray, list, constants

index
Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is
passed.

dtype
dtype is for data type. If None, data type will be inferred

copy
Copy data. Default False

A series can be created using various inputs like −

 Array
 Dict
 Scalar value or constant
Create an Empty Series
A basic series, which can be created is an Empty Series.

Example

#import the pandas library and aliasing as pd


import pandas as pd
s = pd.Series()
print s
Its output is as follows −
Series([], dtype: float64)

Create a Series from ndarray


If data is an ndarray, then index passed must be of the same length. If no index is passed,
then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].

Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s

Its output is as follows −
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0
to len(data)-1, i.e., 0 to 3.

Example 2

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed values in the
output.

Create a Series from dict


A dict can be passed as input and if no index is specified, then the dictionary keys are
taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.

Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.

Example 2

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s
Its output is as follows −
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

Observe − Index order is persisted and the missing element is filled with NaN (Not a
Number).

Create a Series from Scalar


If data is a scalar value, an index must be provided. The value will be repeated to match
the length of index

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s

Its output is as follows −
0 5
1 5
2 5
3 5
dtype: int64

Accessing Data from Series with Position


Data in the series can be accessed similar to that in an ndarray.

Example 1

Retrieve the first element. As we already know, the counting starts from zero for the array,
which means the first element is stored at zeroth position and so on.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element


print s[0]

Its output is as follows −
1

Example 2

Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from
that index onwards will be extracted. If two parameters (with : between them) is used,
items between the two indexes (not including the stop index)

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element


print s[:3]
Its output is as follows −
a 1
b 2
c 3
dtype: int64
Example 3

Retrieve the last three elements.


Live Demo

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element


print s[-3:]
Its output is as follows −
c 3
d 4
e 5
dtype: int64

Retrieve Data Using Label (Index)


A Series is like a fixed-size dict in that you can get and set values by index label.

Example 1

Retrieve a single element using index label value.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element


print s['a']
Its output is as follows −
1
Example 2

Retrieve multiple elements using a list of index label values.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements


print s[['a','c','d']]
Its output is as follows −
a 1
c 3
d 4
dtype: int64

Example 3

If a label is not contained, an exception is raised.

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements


print s['f']
Its output is as follows −

KeyError: 'f'
Creating DataFrames from scratch
Creating DataFrames right in Python is good to know and quite useful when
testing new methods and functions you find in the pandas docs.

There are many ways to create a DataFrame from scratch, but a great option is


to just use a simple dict.
Let's say we have a fruit stand that sells apples and oranges. We want to have a
column for each fruit and a row for each customer purchase. To organize this as
a dictionary for pandas we could do something like:

data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
And then pass it to the pandas DataFrame constructor:

purchases = pd.DataFrame(data)

purchases
OUT:
apples
oranges
0
1
2
3

How did that work?


Each (key, value) item in data corresponds to a column in the resulting
DataFrame.
The Index of this DataFrame was given to us on creation as the numbers 0-3,
but we could also create our own when we initialize the DataFrame.
Let's have customer names as our index:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily',


'David'])

purchases

OUT:
apples
oranges
June
Robert
Lily
David

So now we could locate a customer's order by using their name:


purchases.loc['June']
OUT:
apples 3
oranges 0
Name: June, dtype: int64
There's more on locating and extracting data from the DataFrame later, but now
you should be able to create a DataFrame with any random data to learn on.
Let's move on to some quick methods for creating DataFrames from various
other sources.

Imports the following commands to start:

import pandas as pd
import numpy as np
Pandas version:

import pandas as pd
print(pd.__version__)
Key and Imports

pandas DataFrame object

pandas Series object

Create Dataframe:

import pandas as pd

df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':
[86,97,96,72,83]});

print(df)

Copy

Sample Output:

X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83

Create DataSeries:
import pandas as pd

s = pd.Series([2, 4, 6, 8, 10])

print(s)

Copy

Sample Output:

0 2
1 4
2 6
3 8

4 10
dtype: int64

Syntax: Creating DataFrames

Specify values for each column:


In [ ]:
import pandas as pd
df = pd.DataFrame(
{"a" : [4 ,5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = [1, 2, 3])
Specify values for each row:
In [ ]:
df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])

Create DataFrame with a MultiIndex:


In [ ]:
df = pd.DataFrame(
{"a" : [4 ,5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = pd.MultiIndex.from_tuples(
[('d',1),('d',2),('e',2)],
names=['n','v']))
Reshaping Data: Change the layout of a data set:
In [ ]:
#Gather columns into rows.
pd.melt(df)
In [ ]:
#Append columns of DataFrames
#pd.concat([df1,df2], axis=1)
In [ ]:
#Order rows by values of a column (high to low).
#df.sort_values('mpg',ascending=False)
In [ ]:
#Rename the columns of a DataFrame
df.rename(columns = {'y':'year'})
In [ ]:
#Sort the index of a DataFrame
df.sort_index()
In [ ]:
#Reset index of DataFrame to row numbers, moving index to columns.
df.reset_index()
Subset Observations (Rows)
In [ ]:
#Extract rows that meet logical criteria
df[df.Length > 7]
In [ ]:
#Remove duplicate rows (only considers columns).
df.drop_duplicates()
In [ ]:
# Select first n rows
df.head(n)
In [ ]:
# Select last n rows.
df.tail(n)
In [ ]:
# Randomly select fraction of rows.
df.sample(frac=0.5)
In [ ]:
# Randomly select n rows.
df.sample(n=10)
In [ ]:
# Select rows by position.
df.iloc[10:20]
In [ ]:
Select and order top n entries.
df.nlargest(n, 'value')
In [ ]:
# Select and order bottom n entries.
df.nsmallest(n, 'value')
Subset Variables (Columns)
In [ ]:
# Select multiple columns with specific names.
df[['width','length','species']]
In [ ]:
# Select single column with specific name.
df['width'] or df.width
In [ ]:
# Select columns whose name matches regular expression regex.
df.filter(regex='regex')
regex (Regular Expressions) Examples
In [ ]:
'\.' - Matches strings containing a period '.'
In [ ]:
'Length$' - Matches strings ending with word 'Length'
In [ ]:
'^Sepal' - Matches strings beginning with the word 'Sepal'
In [ ]:
'^x[1-5]$' - Matches strings beginning with 'x' and ending with 1,2,3,4,5
In [ ]:
'^(?!Species$).*' - Matches strings except the string 'Species'
In [ ]:
# Select all columns between x2 and x4 (inclusive).
df.loc[:,'x2':'x4']
In [ ]:
# Select columns in positions 1, 2 and 5 (first column is 0).
df.iloc[:,[1,2,5]]
In [ ]:
# Select rows meeting logical condition, and only the specific columns.
df.loc[df['a'] > 10, ['a','c']]
Handling Missing Data
In [ ]:
# Drop rows with any column having NA/null data.
df.dropna()
# Replace all NA/null data with value.
df.fillna(value)
Make New Columns
In [ ]:
# Compute and append one or more new columns
df.assign(Area=lambda df: df.Length*df.Height)
In [ ]:
# Add single column.
df['Volume'] = df.Length*df.Height*df.Depth
In [ ]:
# Bin column into n buckets.
pd.qcut(df.col, n, labels=False)
pandas provides a large set of vector functions that operate on all
columns of a DataFrame or a single selected column (a pandas
Series). These functions produce vectors of values for each of the
columns, or a single Series for the individual Series. Examples:
In [ ]:
# Element-wise max.
max(axis=1)
In [ ]:
# Trim values at input thresholds
clip(lower=-10,upper=10)
In [ ]:
min(axis=1)
Element-wise min
In [ ]:
abs()
Absolute value.
Combine Data Sets
Standard Joins
In [ ]:
# Join matching rows from bdf to adf.
pd.merge(adf, bdf,
how='left', on='x1')
In [ ]:
# Join matching rows from adf to bdf.
pd.merge(adf, bdf,
how='right', on='x1')
In [ ]:
# Join data. Retain only rows in both sets.
pd.merge(adf, bdf,
how='inner', on='x1')
In [ ]:
# Join data. Retain all values, all rows.
pd.merge(adf, bdf,
how='outer', on='x1')
Filtering Joins
In [ ]:
# All rows in adf that have a match in bdf.
adf[adf.x1.isin(bdf.x1)]
Set-like Operations
In [ ]:
# Rows that appear in both ydf and zdf (Intersection).
pd.merge(ydf, zdf)
In [ ]:
# Rows that appear in either or both ydf and zdf (Union).
pd.merge(ydf, zdf, how='outer')
In [ ]:
# Rows that appear in ydf but not zdf (Setdiff).
pd.merge(ydf, zdf, how='outer',
indicator=True)
.query('_merge == "left_only"')
.drop(columns=['_merge'])
Group Data
In [ ]:
# Return a GroupBy object, grouped by values in column named "col".
df.groupby(by="col")
In [ ]:
# Return a GroupBy object, grouped by values in index level named "ind".
df.groupby(level="ind")
All of the summary functions listed above can be applied to a group. Additional GroupBy functions:
In [ ]:
# Size of each group.
size()
In [ ]:
# Aggregate group using function.
agg(function)
The examples below can also be applied to groups.
In [ ]:
# Copy with values shifted by 1.
shift(1)
In [ ]:
# Ranks with no gaps
rank(method='dense')
In [ ]:
# Ranks. Ties get min rank.
rank(method='min')
In [ ]:
# Ranks rescaled to interval [0, 1].
rank(pct=True)
In [ ]:
# Ranks. Ties go to first value.
rank(method='first')
In [ ]:
# Copy with values lagged by 1. #
shift(-1)
In [ ]:
# Cumulative sum.
cumsum()
In [ ]:
# Cumulative max.
cummax()
In [ ]:
# Cumulative min.
cummin()
In [ ]:
# Cumulative product
cumprod()
Windows
In [ ]:
# Return an Expanding object allowing summary functions to be applied cumulatively
df.expanding()
In [ ]:
# Return a Rolling object allowing summary functions to be applied to windows of length n.
df.rolling(n)
Plotting
In [2]:
import matplotlib as plot
import pandas as pd
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]});
print(df)
df.plot.hist() # Histogram for each column
X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x9155eb8>

In [3]:
# Scatter chart using pairs of points
import matplotlib.pyplot as plt
from pylab import randn
X = randn(200)
Y = randn(200)
plt.scatter(X,Y, color='r')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

You might also like