Professional Documents
Culture Documents
Outline: Introduction To Pandas Module
Outline: Introduction To Pandas Module
1 / 20
Outline Importing/Exporting Data Series/DataFrame
All of the data readers in pandas load data into a pandas DataFrame.
Comma-separated value (CSV) files can be read using read_csv.
Excel files, both 97/2003 (xls) and 2007/10/13 (xlsx), can be imported using
read_excel.
Stata files can be read using read_stata.
2 / 20
Outline Importing/Exporting Data Series/DataFrame
read_excel()
read_excel() supports reading data from both xls (Excel 2003) and xlsx
(Excel 2007/10/13) formats.
Notable keyword arguments include:
header, an integer indicating which row to use for the column labels. The
default is 0 (top) row.
skiprows, typically an integer indicating the number of rows at the top of the
sheet to skip before reading the file. The default is 0.
skip_footer, typically an integer indicating the number of rows at the bottom
of the sheet to skip when reading the file. The default is 0.
index_col, an integer or column name indicating the column to use as the index.
If not provided, a basic numeric index is generated.
3 / 20
Outline Importing/Exporting Data Series/DataFrame
read_csv()
4 / 20
Outline Importing/Exporting Data Series/DataFrame
Import/Export Data
5 / 20
Outline Importing/Exporting Data Series/DataFrame
Pandas
Series are the primary building block of the data structures in pandas, and in
many ways a Series behaves similar to a NumPy array.
6 / 20
Outline Importing/Exporting Data Series/DataFrame
Series
# Series
In [28]: s = Series ([0.1 , 1.2 , 2.3 , 3.4 , 4.5])
In [29]: s
Out [29]:
0 0.1
1 1.2
2 2.3
3 3.4
4 4.5
dtype : float64
In [30]: type ( s )
Out [30]: pandas . core . series . Series
In [31]: # From array
In [32]: a = np . array ([0.1 , 1.2 , 2.3 , 3.4 , 4.5])
In [33]: s = Series ( a )
In [34]: s
Out [34]:
0 0.1
1 1.2
2 2.3
3 3.4
4 4.5
dtype : float64
# From tuple
In [36]: a =(1 ,2 ,3 ,4 , ' abs ' , np . nana )
In [37]: s = Series ( a )
In [38]: s
Out [38]:
0 1
1 2
2 3
3 4
4 abs
5 NaN
dtype : object
7 / 20
Outline Importing/Exporting Data Series/DataFrame
Series
8 / 20
Outline Importing/Exporting Data Series/DataFrame
Series
9 / 20
Outline Importing/Exporting Data Series/DataFrame
Series
In mathematical operations, indices that do not match are given the value NaN
(not a number).
In [143]: s1 = Series ({ ' a ' : 0.1 , 'b ': 1.2 , 'c ': 2.3})
In [144]: s2 = Series ({ ' a ' : 1.0 , 'b ': 2.0 , 'c ': 3.0})
In [145]: s3 = Series ({ ' c ' : 0.1 , 'd ': 1.2 , 'e ': 2.3})
In [146]: s1 + s2
Out [146]:
a 1.1
b 3.2
c 5.3
dtype : float64
In [147]: s1 * s2
Out [147]:
a 0.1
b 2.4
c 6.9
dtype : float64
In [148]: s1 + s3
Out [148]:
a NaN
b NaN
c 2.4
d NaN
e NaN
dtype : float64
10 / 20
Outline Importing/Exporting Data Series/DataFrame
Series
11 / 20
Outline Importing/Exporting Data Series/DataFrame
Series
In [107]: s1 = Series ([1.0 ,2 ,3])
In [108]: s1 . values # access to values
Out [108]: array ([1. , 2. , 3.])
In [109]: s1 . index
Out [109]: RangeIndex ( start =0 , stop =3 , step =1)
In [110]: s1 . index = [ ' cat ' , ' dog ' , ' elephant ' ] # set index labels
In [111]: s1 . index
Out [111]: Index ([ ' cat ' , ' dog ' , ' elephant ' ] , dtype = ' object ' )
# Try the followings
s1 = Series ([1 ,3 ,5 ,6 , np . nan , ' cat ' , ' abc ' ,10 ,12 ,5])
s1 . index =[ ' a ' , ' b ' , ' c ' , ' d ' , ' e ' , ' f ' , ' g ' , ' h ' , ' i ' , ' k ' ]
s1 . head ()
s1 . tail ()
s1 . isnull ()
s1 . notnull ()
s1 . loc [ ' e ' ]
s1 . iloc [4]
s1 . drop ( ' e ' )
s1 . dropna ()
s1 . fillna ( -99)
s1 = Series ( np . arange (10.0 ,20.0))
s1 . describe ()
summ = s1 . describe ()
summ
summ [ ' mean ' ]
s1 = Series ([1 , 2 , 3])
s2 = Series ([4 , 5 , 6])
s1 . append ( s2 )
s1 . append ( s2 , ignore_index = True )
s1 = Series ([1 , 2 , 3])
s2 = Series ([4 , 5 , 6])
s1 . replace (1 , -99)
s1 . update ( s2 )
s1
12 / 20
Outline Importing/Exporting Data Series/DataFrame
DataFrame
DataFrames collect multiple series in the same way that a spreadsheet collects
multiple columns of data.
In [2]: df = DataFrame ( array ([[1 ,2] ,[3 ,4]]) , columns =[ ' dogs ' , ' cats ' ] , \
...: index =[ ' Alice ' , ' Bob ' ])
In [3]: df
Out [3]:
dogs cats
Alice 1 2
Bob 3 4
13 / 20
Outline Importing/Exporting Data Series/DataFrame
Accessing columns:
1 accessing via dictionary-style indexing of the column name: data[’state’]
2 attribute-style accessing with column names: data.state
14 / 20
Outline Importing/Exporting Data Series/DataFrame
Finally, rows can also be selected using logical selection using a Boolean array
with the same number of elements as the number of rows as the DataFrame.
recession = data [ ' gdp_growth_2010 ' ] <0
data [ recession ] # returns states for which gdp_growth_2010 is negative
15 / 20
Outline Importing/Exporting Data Series/DataFrame
It is not possible to use standard slicing to select both rows and columns. But
we can use loc[rowselector, coloumnselector].
data . loc [10:15 , ' state ' ]
data . loc [ recession , ' state ' ]
data . loc [ recession ,[ ' state ' , ' gdp_growth_2010 ' ]]
data . loc [ recession ,[ ' state ' , ' gdp_growth_2009 ' , ' gdp_growth_2010 ' ]]
Note that in the case of loc indexer, we index the underlying data in an
array-like style but using the explicit index and column names.
Using the iloc indexer, we can index the underlying array as if it is a simple
NumPy array (using the implicit Python-style index).
In [31]: data . iloc [0:2 ,0:5]
Out [31]:
state_code state gdp_2009 gdp_2010 gdp_2011
0 AK Alaska 44215 43472 44232
1 AL Alabama 149843 153839 155390
16 / 20
Outline Importing/Exporting Data Series/DataFrame
17 / 20
Outline Importing/Exporting Data Series/DataFrame
# Deleting a column
data_copy = data . copy ()
# Keep only ' gdp_2009 ' , ' gdp_growth_2011 ' and ' gdp_growth_2012 '
data_copy = data_copy [[ ' gdp_2009 ' , ' gdp_growth_2011 ' , ' gdp_growth_2012 ' ]]
data_copy . head ()
# Drop the gdp_2009 column
data_copy . drop ( ' gdp_2009 ' , axis =1)
data_copy . drop ( ' gdp_2009 ' , axis =1 , inplace = True )
# Delete ' gdp_growth_2012 '
data_2012 = data_copy . pop ( ' gdp_growth_2012 ' )
data_2012 . head ()
data_copy . head ()
# Delete gdp_growth_2011 column
del data_copy [ ' gdp_growth_2011 ' ]
data_copy . head ()
18 / 20
Outline Importing/Exporting Data Series/DataFrame
Some useful functions and methods are listed in the table below.
19 / 20
Outline Importing/Exporting Data Series/DataFrame
20 / 20