Professional Documents
Culture Documents
• Series
• Data Frame
• Panel
These data structures are built on top of Numpy array, which means they are fast.
Dimension & Description
The best way to think of these data structures is that the higher dimensional data
structure is a container of its lower dimensional data structure. For example,
DataFrame is a container of Series, Panel is a container of DataFrame.
Building and handling two or more dimensional arrays is a tedious task, burden is
placed on the user to consider the orientation of the data set when writing functions.
But using Pandas data structures, the mental effort of the user is reduced.
For example, with tabular data (DataFrame) it is more semantically helpful to think of
the index (the rows) and the columns rather than axis 0 and axis 1.
Mutability
All Pandas data structures are value mutable (can be changed) and except Series all
are size mutable. Series is size immutable.
Note − DataFrame is widely used and one of the most important data structures. Panel
is used much less.
Series
Series is a one-dimensional array like structure with homogeneous data. For example,
the following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26 72
Key Points
• Homogeneous data
• Size Immutable
• Values of Data Mutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,
Name Age Gender Rating
The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.
Column Type
Name String
Age Integer
Gender String
Rating Float
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable
Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to
represent the panel in graphical representation. But a panel can be illustrated as a
container of DataFrame.
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable
Python Pandas – Series
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.
pandas.Series
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −
1
data
data takes various forms like ndarray, list, constants
2
index
Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.
3
dtype
dtype is for data type. If None, data type will be inferred
4
copy
Copy data. Default False
• Array
• Dict
• Scalar value or constant
Create an Empty Series
A basic series, which can be created is an Empty Series.
Example
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print s
Its output is as follows −
Series([], dtype: float64)
Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is
passed, then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
Example 1
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s
Its output is as follows −
0a
1b
2c
3d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0
to len(data)-1, i.e., 0 to 3.
Example 2
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed values in
the output.
import pandas as pd
print(series3)
Output :
Subtraction of 2 Series
# importing the module
import pandas as pd
print(series3)
Output :
Multiplication of 2 Series
# importing the module
import pandas as pd
print(series3)
Output :
Division of 2 Series
# importing the module
import pandas as pd
print(series3)
Output :
head() and tail() functions of Series Object
Pandas Series provides two very useful methods for extracting the data from
the top and bottom of the Series Object. These methods are head() and tail().
Head() Method
head() method is used to get the elements from the top of the series. By
default, it gives 5 elements.
Syntax:
<Series Object> . head(n = 5)
Example:
Consider the following Series, we will perform the operations on the below
given Series S.
1 import pandas as pd
2 s = pd.Series({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7,'H':8})
3 print(s)
4
5 Output:
6A 1
7B 2
8C 3
9D 4
10 E 5
11 F 6
12 G 7
13 H 8
14 dtype: int64
15
1 s.head()
2 Output:
3A 1
4B 2
5C 3
6D 4
7E 5
8 dtype: int64
9
head() Function with Positive Argument
When a positive number is provided, the head() function will extract the top n
rows from Series Object. In the below given example, I have given 7, so 7
rows from the top has been extracted.
1 s.head(7)
2 Output:
3A 1
4B 2
5C 3
6D 4
7E 5
8F 6
9G 7
10 dtype: int64
11
1 s.head(-7)
2
3 Output:
4A 1
5 dtype: int64
Tail() Method
tail() method gives the elements of series from the bottom.
Syntax:
<Series Object> . tail(n = 5)
Example:
Consider the following Series, we will perform the operations on the below
given Series S.
1 import pandas as pd
2 s = pd.Series({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7,'H':8})
3 print(s)
4
5 Output:
6A 1
7B 2
8C 3
9D 4
10 E 5
11 F 6
12 G 7
13 H 8
14 dtype: int64
15
1 s.tail()
2 '''
3 Output:
4D 4
5E 5
6F 6
7G 7
8H 8
9 dtype: int64
10 '''
1 s.tail(-7)
2 '''
3 Output:
4H 8
5 dtype: int64
6 '''
Indexing-
Pandas series is a One-dimensional ndarray with axis labels. The labels
need not be unique but must be a hashable type. The object supports both
integer- and label-based indexing and provides a host of methods for
performing operations involving the index.
Pandas Series.index attribute is used to get or set the index labels of the
given Series object.
Syntax:Series.index
Parameter : None
Returns : index
Example #1: Use Series.index attribute to set the index label for the given
Series object.
# importing pandas as pd
import pandas as pd
Now we will use Series.index attribute to set the index label for the given
object.
Output :
Now we will use Series.index attribute to get the index label for the given
object.
# print the index labels
print(sr.index)
Output :
Slicing
Slicing is a powerful approach to retrieve subsets of data from a pandas object.
A slice object is built using a syntax of start:end:step, the segments
representing the first item, last item, and the increment between each item that
you would like as the step.
import pandas as pd
num = [000, 100, 200, 300, 400, 500, 600, 700, 800, 900]
idx = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
series = pd.Series(num, index=idx)
print("\n [2:2] \n")
print(series[2:4])
print("\n [1:6:2] \n")
print(series[1:6:2])
print("\n [:6] \n")
print(series[:6])
print("\n [4:] \n")
print(series[4:])
print("\n [:4:2] \n")
print(series[:4:2])
print("\n [4::2] \n")
print(series[4::2])
print("\n [::-1] \n")
print(series[::-1])
C:\python\pandas examples>python example1f.py
[2:2]
C 200
D 300
dtype: int64
[1:6:2]
B 100
D 300
F 500
dtype: int64
[:6]
A 0
B 100
C 200
D 300
E 400
F 500
dtype: int64
[4:]
E 400
F 500
G 600
H 700
I 800
J 900
dtype: int64
[:4:2]
A 0
C 200
dtype: int64
[4::2]
E 400
G 600
I 800
dtype: int64
[::-1]
J 900
I 800
H 700
G 600
F 500
E 400
D 300
C 200
B 100
A 0
dtype: int64
Features of DataFrame
• Potentially columns are of different types
• Size – Mutable
• Labeled axes (rows and columns)
• Can Perform Arithmetic operations on rows and columns
Structure
Let us assume that we are creating a data frame with student’s data.
You can think of it as an SQL table or a spreadsheet data representation.
pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
The parameters of the constructor are as follows −
1
data
data takes various forms like ndarray, series, map, lists, dict, constants and
also another DataFrame.
2
index
For the row labels, the Index to be used for the resulting frame is Optional
Default np.arange(n) if no index is passed.
3
columns
For column labels, the optional default syntax is - np.arange(n). This is only
true if no index is passed.
4
dtype
Data type of each column.
5
copy
This command (or whatever it is) is used for copying of data, if the default
is False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
In the subsequent sections of this chapter, we will see how to create a DataFrame
using these inputs.
Example
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df
Example 1
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
Example 1
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Example 2
Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Example 1
The following example shows how to create a DataFrame by passing a list of
dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
Example 2
The following example shows how to create a DataFrame by passing a list of
dictionaries and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2
#df2 output
a b1
first 1 NaN
second 5 NaN
Note − Observe, df2 DataFrame is created with a column index other than the
dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with
column indices same as dictionary keys, so NaN’s appended.
Example
import pandas as pd
df = pd.DataFrame(d)
print df
Column Selection
We will understand this by selecting a column from the DataFrame.
Example
import pandas as pd
df = pd.DataFrame(d)
print df ['one']
Column Addition
We will understand this by adding a new column to an existing data frame.
Example
import pandas as pd
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing new
series
print df
Column Deletion
Columns can be deleted or popped; let us take an example to understand how.
Example
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
Selection by Label
Rows can be selected by passing row label to a loc function.
import pandas as pd
df = pd.DataFrame(d)
print df.loc['b']
import pandas as pd
df = pd.DataFrame(d)
print df.iloc[2]
import pandas as pd
df = pd.DataFrame(d)
print df[2:4]
import pandas as pd
df = df.append(df2)
print df
import pandas as pd
df = df.append(df2)
print df
Output:
Example 2: Rename multiple columns.
# Import pandas package
import pandas as pd
# Define a dictionary containing ICC rankings
rankings = {'test': ['India', 'South Africa', 'England',
'New Zealand', 'Australia'],
'odi': ['England', 'India', 'New Zealand',
'South Africa', 'Pakistan'],
't20': ['Pakistan', 'India', 'Australia',
'England', 'New Zealand']}
# Convert the dictionary into DataFrame
rankings_pd = pd.DataFrame(rankings)
# Before renaming the columns
print(rankings_pd.columns)
rankings_pd.rename(columns = {'test':'TEST', 'odi':'ODI',
't20':'T20'}, inplace = True)
# After renaming the columns
print(rankings_pd.columns)
Output:
The columns can also be renamed by directly assigning a list containing the new names
to the columns attribute of the Dataframe object for which we want to rename the
columns. The disadvantage of this method is that we need to provide new names for all
the columns even if want to rename only some of the columns.
Output:
In this example, we will rename the column name using the set_axis function, we will
pass the new column name and axis that should be replaced with a new name in the
column as a parameter.
Output:
rankings_pd = rankings_pd.add_prefix('col_')
rankings_pd = rankings_pd.add_suffix('_1')
# After renaming the columns
rankings_pd.head()
Output:
col_test_1 col_odi_1 col_t20_1
0 India England Pakistan
1 South Africa India India
2 England New Zealand Australia
3 New Zealand South Africa England
4 Australia Pakistan New Zealand
import pandas as pd
rankings_pd = pd.DataFrame(rankings)
print(rankings_pd.columns)
# df = rankings_pd
rankings_pd.head()
Output:
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The first two rows of the data frame is:")
print df.head(2)
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The last two rows of the data frame is:")
print df.tail(2)
Output:
Output:
In the above example, we use the concept of label based Fancy Indexing to access
multiple elements of data frame at once and hence create a new column ‘Value‘
using function dataframe.lookup()
Example 2:
# importing pandas library
import pandas as pd
Output:
Output:
In the above example, we use the concept of label based Fancy Indexing to access
multiple elements of data frame at once and hence create two new columns ‘Age‘
and ‘Marks‘ using function dataframe.lookup()
Example 3:
# importing pandas library
import pandas as pd
# Creating a Data frame
df = pd.DataFrame([['Date1', 1850, 1992,'Avi', 5, 41, 70, 'Avi'],
['Date2', 1896, 1950, 'Cathy', 10, 1, 22, 'Avi'],
['Date2', 1900, 1920, 'Cathy', 24, 11, 44, 'Cathy'],
['Date1', 1889, 1960, 'Bob', 2, 11, 10, 'Bob'],
['Date2', 1910, 1952, 'Avi', 20, 10, 40, 'Bob'],
['Date1', 1999, 1929, 'Avi', 50, 8, 11, 'Cathy']],
columns=('Year', 'Date1', 'Date2', 'Name', 'Avi',
'Bob', 'Cathy', 'Alpha'))
# Display Data frame
df
Output:
# Use concept of fancy indexing to make two
Output:
In the above example, we use the concept of label based Fancy Indexing to access
multiple elements of the data frame at once and hence create two new columns
‘Age‘, ‘Height‘ and ‘Date_of_Birth‘ using function dataframe.lookup()
Boolean indexing is a type of indexing that uses actual values of the data in
the DataFrame. In boolean indexing, we can filter a data in four ways:
• Accessing a DataFrame with a boolean index
• Applying a boolean mask to a dataframe
• Masking data based on column value
• Masking data based on an index value
Accessing a DataFrame with a boolean index:
In order to access a dataframe with a boolean index, we have to create a
dataframe in which the index of dataframe contains a boolean value that is
“True” or “False”.
Example
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
df = pd.DataFrame(dict, index = [True, False, True, False])
print(df)
Output:
Now we have created a dataframe with the boolean index after that user can
access a dataframe with the help of the boolean index. User can access a
dataframe using three functions that is .loc[], .iloc[], .ix[]
Accessing a Dataframe with a boolean index using .loc[]
In order to access a dataframe with a boolean index using .loc[], we simply
pass a boolean value (True or False) in a .loc[] function.
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
# accessing a dataframe using .loc[] function
print(df.loc[True])
Output:
Output:
TypeError
Code #2:
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
# accessing a dataframe using .iloc[] function
print(df.iloc[1])
Output:
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True, False])
Output:
Code #2:
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
Output: