Professional Documents
Culture Documents
Pandas Unexplored PDF
Pandas Unexplored PDF
pandas
Methods to read data are all named read_* to_*
pd.read_* where * is the file type. Series
and DataFrames can be saved to disk
using their to_* method.
DataFrame
Usage Patterns h5 X Y Z h5
a
• Use pd.read_clipboard() for one-off data b
extractions. c
+ +
Reading Text Files into a DataFrame
Colors highlight how different arguments map from the data file to a DataFrame.
# Historical_data.csv
Date Cs Rd
Date, Cs, Rd >>> read_table(
2005-01-03, 64.78, - 'historical_data.csv',
sep=',',
2005-01-04, 63.79, 201.4
header=1,
2005-01-05, 64.46, 193.45
skiprows=1,
... skipfooter=2,
Data from Lab Z. index_col=0,
Recorded by Agent E parse_dates=True,
na_values=['-'])
, ,
X Y X Y X Y
a a a
>>> df_list = read_html(url) b b b
c c c
a 1 NaN a NaN a 1 0 a 1
b 6
b 2
NaN
b 4
c 5 c NaN
b 2
0
b 4
c 5
b 6
c 5
Rule 2: Element-By-Element
Mathematical Operations
Use add, sub, mul, div, to set fill value.
df + 1 df.abs() np.log(df)
Rule 3: Reduction Operations
X Y X Y X Y X Y
>>> df.sum() Series a -2 -2 a -1 -1 a 1 1 a 0 0
b -2 -2 b -1 -1 b 1 1 b 0 0
X Y c -2 -2 c -1 -1 c 1 1 c 0 0
df.sum()
a X
b Y
c Apply a Function to Each Value
Operates across rows by default (axis=0, or axis='rows'). Apply a function to each value in a Series or DataFrame
Operate across columns with axis=1 or axis='columns'. s.apply(value_to_value) Series
df.applymap(value_to_value) DataFrame
Value
Y
b b Z
c c
Time
With a Series, Pandas plots values against the With a DataFrame, Pandas creates one line per Use Matplotlib to override or add annotations:
index: column: > ax.set_xlabel('Time')
> ax = s.plot() > ax = df.plot() > ax.set_ylabel('Value')
> ax.set_title('Experiment A')
When plotting the results of complex manipulations with groupby, it's often useful to
Pass labels if you want to override
stack/unstack the resulting DataFrame to fit the one-line-per-column assumption (see
Data Structures cheatsheet). the column names and set the legend
location:
Useful Arguments to plot > ax.legend(labels, loc='best')
X Y
a
b
c
Red Panda
• subplots=True: one subplot per column, instead of one line Ailurus fulgens
• figsize: set figure size, in inches
• x and y: plot one column against another
Kinds of Plots
+
df.plot.scatter(x, y) df.plot.bar() df.plot.hist() df.plot.box()
Frequency Offsets
Used by date_range, period_range and resample:
Creating Ranges or Periods
• B: Business day • A: Year end > pd.period_range(start=None, end=None,
• D: Calendar day • AS: Year start periods=None, freq=offset)
• W: Weekly • H: Hourly
• M: Month end • T, min: Minutely
• MS: Month start • S: Secondly
Resampling
• BM: Business month end • L, ms: Milliseconds
> s_df.resample(freq_offset).mean()
• Q: Quarter end • U, us: Microseconds
For more: • N: Nanoseconds resample returns a groupby-like object that must be
Lookup "Pandas Offset Aliases" or check out pandas.tseries.offsets, aggregated with mean, sum, std, apply, etc. (See also the
and pandas.tseries.holiday modules. Split-Apply-Combine cheat sheet.)
X Y Z
Other Groupby-Like Operations: Window Functions
0 a 1 1
X Y Z • resample, rolling, and ewm (exponential weighted
2 a 1 1
0 a 1 1 0
X Y Z
g.filter(…) function) methods behave like GroupBy objects. They keep
1 b 1 1 1
1 b 1 1 track of which row is in which “group”. Results must be
2 a 1 1 2
3 b 1 1 aggregated with sum, mean, count, etc. (see Aggregation).
3 b 1 1 • resample is often used before rolling, expanding, and 3
X Y Z
4 c 0 0 ewm when using a DateTime index. 4
Ta k e your P a n d a s s ki l l s to th e n ex t l ev e l ! Re g i s t e r a t w w w . e n t h o u g h t . c o m / p a n d a s -m a s t e r y -w o r kshop
© 2019 E nt ho ught , Inc., licensed under t he Creat i ve C om m ons At t r i b ut i on-NonC om m e r ci al -NoD e r i vat i ve s 4 . 0 I nt e r nat i onal L i ce nse .
To v iew a copy o f t his license , vi s i t ht t p : / / cr e at i ve com m ons . or g / l i ce ns e s / b y-nc-nd / 4 . 0 /
Reshaping DataFrames and Pivot Tables
pandas
Tools for reshaping DataFrames from the wide to the long format and back.
The long format can be tidy, which means that "each variable is a column,
each observation is a row"1. Tidy data is easier to filter, aggregate,
transform, sort, and pivot. Reshaping operations often produce multi-level
Long to Wide Format and Back
indices or columns, which can be sliced and indexed. with stack() and unstack()
1 Hadley Wickham (2014) "Tidy Data", http://dx.doi.org/10.18637/jss.v059.i10
df.pivot() vs pd.pivot_table
Ta k e your P a n d a s s ki l l s to th e n ex t l ev e l ! Re g i s t e r a t w w w . e n t h o u g h t . c o m / p a n d a s -m a s t e r y -w o r kshop
© 2019 E nt hought , Inc., licensed under t he Creat i ve C om m ons At t r i b ut i on-NonC om m e r ci al -NoD e r i vat i ve s 4 . 0 I nt e r nat i onal L i ce nse .
To v iew a copy o f t his license , vi s i t ht t p : / / cr e at i ve com m ons . or g / l i ce ns e s / b y-nc-nd / 4 . 0 /