Professional Documents
Culture Documents
iteration
• The behavior of basic iteration over Pandas objects
depends on the type.
• When iterating over a Series, it is regarded as
array-like, and basic iteration produces the values.
• Other data structures, like DataFrame and Panel,
follow the dict-like convention of iterating over
the keys of the objects.
• In short, basic iteration (for i in object) produces −
• Series − values
• DataFrame − column labels
• Panel − item labels
Iterating a DataFrame
Iterating a DataFrame gives column names.
import pandas as pd
import numpy as np
N=20
df = pd.DataFrame({ ‘
A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N), 'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist() })
for col in df:
print col
import pandas as pd
import numpy as np
stud = {‘Name’:[‘P’,’R’,’A’,’J’,’B’],
‘Eng’:[67,76,75,88,92],
‘IP’:[99,99,98,97,98],
’Maths’:[98,99,97,98,90]}
df = pd.DataFrame(stud)
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df=df.sort_index()
print(sorted_df)
Order of Sorting
• By passing the Boolean value to ascending
parameter, the order of the sorting can be
controlled.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df = df.sort_index(ascending=False)
print(sorted_df)
Sort the Columns
• By passing the axis argument with a value 0 or 1,
the sorting can be done on the column labels.
• By default, axis=0, sort by row.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df=df.sort_index(axis=1)
print(sorted_df)
By Value
• Like index sorting, sort_values() is the method for
sorting by values.
• It accepts a 'by' argument which will use the
column name of the DataFrame with which the
values are to be sorted.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df = df.sort_values(by=‘IP')
print(sorted_df)
Sorting Algorithm
• sort_values() provides a provision to choose the
algorithm from mergesort, heapsort and
quicksort. Mergesort is the only stable algorithm.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df = df.sort_values(by=‘IP',kind='mergesort')
print(sorted_df)
head() and tail() function
• The head() function fetches first ‘n’ rows from
the pandas series. By default it shows first 5
rows of the given series.
• Ex. df.head() or df.head(2)
• The tail() function fetches last ‘n’ rows from
the pandas series. By default it shows last 5
rows of the given series.
• Ex. df.tail() or df.tail(2)
Boolean Indexing in Pandas
• In boolean indexing, we will select subsets of
data based on the actual values of the data in
the DataFrame and not on their row/column
labels or integer locations. In boolean
indexing, we use a boolean vector to filter the
data.
Boolean Indexing in DataFrame
• Boolean indexing is a type of indexing which uses
actual values of the data in the DataFrame. In
boolean indexing, we can filter a data in four
ways –
# dictionary of lists
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud, index = [True, False, True, False,True])
print(df)
In order to access a dataframe with a boolean index using .loc[], we simply pass a
boolean value (True or False) in a .loc[] function.
# importing pandas as pd
import pandas as pd
# dictionary of lists
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud, index = [True, False, True, False,True])
# accessing a dataframe using .loc[] function
print(df.loc[True])
Applying a boolean mask to a
dataframe :
In a dataframe we can apply a boolean mask in
order to do that we, can
use __getitems__ or [] accessor.
We can apply a boolean mask by giving list of
True and False of the same length as contain
in a dataframe.
When we apply a boolean mask it will print only
that dataframe in which we pass a boolean
value True.
Example
# importing pandas as pd
import pandas as pd
# dictionary of lists
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(dict, index = [0, 1, 2, 3,4])
print(df[[True, False, True, False,True]])
Masking data based on column value
In a dataframe we can filter a data based on a
column value in order to filter data, we can
apply certain condition on dataframe using
different operator like ==, >, <, <=, >=.
When we apply these operator on dataframe
then it produce a Series of True and False.
Example
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["BCA", "BCA", "M.Tech", "BCA"],
'score':[90, 40, 80, 98]}
# creating a dataframe
df = pd.DataFrame(dict)
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["BCA", "BCA", "M.Tech", "BCA"],
'score':[90, 40, 80, 98]}
mask = df.index == 0
print(df[mask])