Data Analysis With PANDAS CHEAT SHEET PDF

Data Structures continued
Data Analysis with PANDAS * DF has a “to_panel()” method which is the

inverse of “to_frame()”.
Common Ops :
Swap and Sort **
series1.swaplevel(0,
1).sortlevel(0)
CHEAT SHEET
# the order of rows also change
** Hierarchical indexing makes N-dimensional
arrays unnecessary in a lot of cases. Aka * The order of the rows do not change. Only the
Created By: Arianne Colton and Sean Chen prefer to use Stacked DF, not Panel data. two levels got swapped.
INDEX OBJECTS ** Data selection performance is much better if

the index is sorted starting with the outermost
level, as a result of calling sortlevel(0) or
Data Structures Immutable objects that hold the axis labels and other
metadata (i.e. axis name) sort_index().
• i.e. Index, MultiIndex, DatetimeIndex, PeriodIndex Summary Statistics by Level

SERIES (1D) Get Columns and df1.columns • Any sequence of labels used when constructing
One-dimensional array-like object containing an array of Row Names df1.index Series or DF internally converted to an Index. Most stats functions in DF or Series have a “level”
data (of any NumPy data type) and an associated array Get Name option that you can specify the level you want on an
Attribute
df1.columns.name • Can functions as fixed-size set in additional to being axis.
of data labels, called its “index”. If index of data is not
specified, then a default one consisting of the integers 0 df1.index.name array-like. Sum rows (that
(None is default)
through N-1 is created. df1.values HIERARCHICAL INDEXING have same ‘key2’ df1.sum(level = 'key2')
value)
series1 = pd.Series ([1, Get Values # returns the data as a 2D ndarray, the Multiple index levels on an axis : A way to work with df1.sum(level = 'col3', axis
Create Series 2], index = ['a', 'b']) dtype will be chosen to accomandate all of Sum columns ..
the columns higher dimensional data in a lower dimensional form. = 1)
series1 = pd.Series(dict1)*
** Get Column as df1['state'] or df1.state MultiIndex :
Get Series Values series1.values Series • Under the hood, the functionality provided here
Get Values by Index
series1['a'] ** Get Row as series1 = Series(np.random.randn(6), index = utilizes panda’s “groupby”.
series1[['b','a']] df1.ix['row2'] or df1.ix[1] [['a', 'a', 'a', 'b', 'b', 'b'], [1, 2, 3,
Series
Get Series Index series1.index 1, 2, 3]]) DataFrame’s Columns as Indexes
Assign a column
Get Name Attribute series1.name that doesn’t exist df1['eastern'] = df1.state
series1.index.names = ['key1', 'key2'] DF’s “set_index” will create a new DF using one or more
will create a new == 'Ohio' of its columns as the index.
(None is default) series1.index.name column
Series Partial series1['b'] # Outer Level df2 = df1.set_index(['col3',
** Common Index series1 + series2 Delete a column del df1['eastern']
Values are Added Indexing series1[:, 2] # Inner Level 'col4']) * ‡
Switch Columns df1.T New DF using
Unique But Unsorted series2 = series1.unique() and Rows df1['outerCol3','InnerCol2'] # col3 becomes the outermost index, col4
DF Partial columns as index
Or becomes inner index. Values of col3, col4
Indexing
* Can think of Series as a fixed-length, ordered * Dicts of Series are treated the same as Nested df1['outerCol3']['InnerCol2'] become the index values.
dict. Series can be substitued into many dict of dicts.
functions that expect a dict. Swaping and Sorting Levels * "reset_index" does the opposite of "set_index",
** Data returned is a ‘view’ on the underlying the hierarchical index are moved into columns.
** Auto-align differently-indexed data in arithmetic data, NOT a copy. Thus, any in-place Swap Level (level swapSeries1 = series1.
operations interchanged) * swaplevel('key1', 'key2')
modificatons to the data will be reflected in df1. ‡ By default, 'col3' and 'col4' will be removed
series1.sortlevel(1) from the DF, though you can leave them by
DATAFRAME (2D) Sort Level
# sorts according to first inner level option : 'drop = False'.
PANEL DATA (3D)
Tabular data structure with ordered collections of
columns, each of which can be different value type. Create Panel Data : (Each item in the Panel is a DF)
Data Frame (DF) can be thought of as a dict of Series.
dict1 = {'state': ['Ohio',
import pandas_datareader.data as web
panel1 = pd.Panel({stk : web.get_data_
Missing Data
'CA'], 'year': [2000, 2010]} yahoo(stk, '1/1/2000', '1/1/2010')
for stk in ['AAPL', 'IBM']}) df1.dropna(how = 'all') # drop row that are all
Python NaN - np.nan(not a number)
df1 = pd.DataFrame(dict1) # panel1 Dimensions : 2 (item) * 861 (major) * 6 (minor) missing
NaN or python built-in None mean
Create DF # columns are placed in sorted order Pandas * df1.dropna(thresh = 3) # drop any row containing
“Stacked” DF form : (Useful way to represent panel data) missing/NA values
(from a dict of < 3 number of observations
df1 = pd.DataFrame(dict1,
equal-length lists index = ['row1', 'row2'])) panel1 = panel1.swapaxes('item', 'minor') * Use pd.isnull(), pd.notnull() or
or NumPy arrays)
series1/df1.isnull() to detect missing data.
FILLING IN MISSING DATA
# specifying index panel1.ix[:, '6/1/2003', :].to_frame() *
df2 = df1.fillna(0) # fill all missing data with 0
df1 = pd.DataFrame(dict1, => Stacked DF (with hierarchical indexing **) :
columns = ['year', 'state']) FILTERING OUT MISSING DATA df1.fillna('inplace = True') # modify in-place
# Open High Low Close Volume Adj-Close
# columns are placed in your given order Use a different fill value for each column :
# major minor dropna() returns with ONLY non-null data, source
* Create DF data NOT modified. df1.fillna({'col1' : 0, 'col2' : -1})
dict1 = {'col1': {'row1': 1, # 2003-06-01 AAPL Only forward fill the 2 missing values in front :
(from nested dict 'row2': 2}, 'col2': {'row1':
of dicts) 3, 'row2': 4} } # IBM df1.dropna() # drop any row containing missing value df1.fillna(method = 'ffill', limit = 2)
The inner keys as df1 = pd.DataFrame(dict1) # 2003-06-02 AAPL df1.dropna(axis = 1) # drop any column i.e. for column1, if row 3-6 are missing. so 3 and 4 get filled
row indices containing missing values
# IBM with the value from 2, NOT 5 and 6.
Essential Functionality Data Aggregation and Group Operations
INDEXING (SLICING/SUBSETTING) † ARITHMETIC AND DATA ALIGNMENT Categorizing a data set and applying a function to DATA AGGREGATION
each group, whether an aggregation or transformation.
Same as ‘NdArray’ : In INDEXING : ‘view’ • df1 + df2 : For indices that don’t overlap, Data aggregation means any data transformation that
†
of the source array is returned. internal data alignment introduces NaN. Aggregation of “Time Series” data - please produces scalar values from arrays, such as “mean”,
1, Instead of NaN, replace with 0 for the indice that is not
Note see Time Series section. Special use case of “max”, etc.
Endpoint is inclusive in pandas slicing with def func1(array): ...
found in th df : “groupby” is used - called “resampling”. Use Self-Defined
† labels : series1['a':'c'] where
Function
Python slicing is NOT. Note that pandas non- df1.add(df2, fill_value = 0) GROUPBY (SPLIT-APPLY-COMBINE) grouped.agg(func1)
Get DF with Column
label (i.e. integer) slicing is still non-inclusive. 2, Useful Operations : - Similar to SQL groupby Names as Fuction grouped.agg([mean, std])
df1 - df1.ix[0] # subtract every row in df1 by first row Names
Index by Column(s) df1['col1']
Compute Group Mean df1.groupby('col2').mean() Get DF with Self- grouped.agg([('col1',
df1[ ['col1', 'col3'] ] Defined Column
SORTING AND RANKING df1.groupby([df1['col2'],
Names mean), ('col2', std)])
Index by Row(s) df1.ix['row1'] df1['col3']]).mean()
GroupBy More Than Use Different Fuction
df1.ix[ ['row1', 'row3'] ] Sort Index/Column † One Key Depending on the
grouped.agg({'col1' : [min,
• sort_index() returns a new, sorted object. Default # result in hierarchical index consisting max], 'col3' : sum})
Index by Both df1.ix[['row2', 'row1'], of unique pairs of keys Column
Column(s) and 'col3'] is “ascending = True”.
“GroupBy” Object : grouped = df1['col1'].
Row(s) • Row index are sorted by default, “axis = 1” is used
groupby(df1['col2'])
GROUP-WISE OPERATIONS AND
for sorting column. (ONLY computed
Boolean Indexing df1[ [True, False] ] intermediate data TRANSFORMATIONS
about the group key grouped.mean() # gets the mean
df1[df1['col2'] > 6] * † Sorting Index/Column means sort the row/ of each group formed by 'col2' Agg() is a special case of data transformation, aka
# returns df that has col2 value > 6 - df1['col2']
column labels, not sorting the data. reduce a one-dimensional array to scalar.
# select ‘col1’ for aggregation :
Note that df1['col2'] > 6 returns a Sort Data Transform() is a specialized data transformation :
df1.groupby('col2')['col1']
boolean Series, with each True/False value Indexing “GroupBy” • It applies a function to each group, if it produces
* Missing values (np.nan) are sorted to the end of the Object or
a scalar value, the value will be placed in every
determine whether the respective row in the Series by default df1['col1']. row of the group. Thus, if DF has 10 rows, after
result. Series Sorting sortedS1 = series1.order() groupby(df1['col2']) “transform()”, there will be still 10 rows, each one with
Avoid integer indexing since it might the scalar value from its respective group’s value from
series1.sort() # In-place sort Any missing values in the group are excluded the function.
introduce subtle bugs (e.g. series1[-1]). Note
Note If have to use position-based indexing, DF Sorting df1.sort_index(by = from the result. • The passed function must either produce a scalar
use "iget_value()" from Series and ['col2', 'col1'])
value or a transformed array of same size.
"irow/icol()" from DF instead of # sort by col2 first then col1 1. Iterating over GroupBy object General purpose transformation : apply()
integer indexing. Ranking “GroupBy” object supports iteration : generating a df1.groupby('col2').apply(your_func1)
sequence of 2-tuples containing the group name along
DROPPING ROWS/COLUMNS Break rank ties by assigning each tie-group the mean with the chunk of data. # your func ONLY need to return a pandas object or a scalar.
rank. (e.g. 3, 3 are tie as the 5th place; thus, the result is # Example 1 : Yearly Correlations with SPX
Drop operation returns a new object (i.e. DF) : 5.5 for each) for name, groupdata in df1.groupby('col2'):
# “close_price” is DF with stocks and SPX closed price columns
Remove Row(s) df1.drop('row1') Output Rank of series1.rank() # name is single value, groupdata is filtered DF contains data and dates index
(axis = 0 is default) df1.drop(['row1', 'row3']) Each Element only match that single value.
df1.rank(axis = 1) for (k1, k2), groupdata in df1. returns = close_price.pct_change().dropna()
df1.drop('col2', axis = 1) (Rank start from 1) # rank each row’s value
Remove Column(s) groupby(['col2', 'col3']): by_year = returns.groupby(lambda x :
# If groupby multiple keys : first element in the tuple is a tuple x.year)
REINDEXING FUNCTION APPLICATIONS of key values.
spx_corr = lambda x : x.corrwith(x['SPX'])
Create a new object with rearraging data conformed to a NumPy works fine with pandas objects : np.abs(df1) Convert Groups dict(list(df1.groupby('col2'))) by_year.apply(spx_corr)
new index, introducing missing values if any index values to Dict # col2 unique values will be keys of dict # Example 2 : Exploratory Regression
were not already present. f = lambda x: x.max() - grouped = df1.groupby([df1.
Applying a import statsmodels.api as sm
Change df1 Date date_index = pd.date_ Function to Each x.min() # return a scalar value Group Columns dtypes, axis = 1)
Index Values to the range('01/23/2010', Column or Row by “dtype” def regress(data, y, x):
def f(x): return dict(list(grouped))
New Index Values periods = 10, freq = 'D') (Default is to apply Y = data[y]; X = data[x]
Series([x.max(), x.min()]) # separates data Into different types
to each column :
axis = 0) # return multiple values X['intercept'] = 1
(ReIndex default is
row index) df1.reindex(date_index) df1.apply(f) 2. Grouping with functions result = sm.OLS(Y, X).fit()
Replace Missing df1.reindex(date_index, Applying a f = lambda x: '%.2f' %x Any function passed as a group key will be called once return result.params
Values (NaN) wth 0 fill_value = 0) Function
df1.applymap(f) per (default is row index) value, with the return values
Element-Wise being used as the group names. (This assumes row by_year.apply(regress, 'AAPL', ['SPX'])
df1.reindex(columns = # format each entry to 2-decimals
ReIndex Columns index are named)
['a', 'b'])
UNIQUE, COUNTS df1.groupby(len).sum() Created by Arianne Colton and Sean Chen
ReIndex Both Rows df1.reindex(index = [..],
and Columns columns = [..]) • It’s NOT mandatory for index labels to be unique # returns a DF with row index that are length of the names. www.datasciencefree.com
although many functions require it. Check via : Thus, names of same length will sum their values. Column Based on content from
Succinct ReIndex df1.ix[[..], [..]] series1/df1.index.is_unique names retain. “Python for Data Analysis” by Wes McKinney
• pd.value_counts() returns value frequency.
Updated: August 22, 2016
Data Wrangling : Merge, Reshape, Clean, Transform
COMBINING AND MERGING DATA RESHAPING AND PIVOTING COMMON OPERATIONS 5. Discretization and Binning
• Continuous data is often discretized into “bins” for
1. pd.merge() aka database “join” : connects 1. Reshaping with Hierarchical Indexing 1. Removing Duplicate Rows analysis.
rows in DF based on one or more keys. series1 = df1.stack()
• Merge via Column (Common) series1 = df1.duplicated() # Boolean series1 # Divide Data Into 2 Bins of Number (18 - 26], (26 - 35]
# Rotates (innermost level *) columns to rows as innermost indicating whether each row is a duplicate or not. # ‘]’ means inclusive, ‘)’ is NOT inclusive
df3 = pd.merge(df1, df2, on = 'col2') *
index level, resulted in Series with hierarchical index. df2 = df1.drop_duplicates()# Duplicates has bins = [18, 26, 35]
# INNER join is default Or use option : how = 'outer/ df1 = series1.unstack() been dropped in df2.
left/right' cat = pd.cut(array1, bins, labels=[..])
# Rotates (innermost level *) rows to columns as innermost 2. Add New Column Based On Value of Column(s) # cat is “Categorical” object.
# the indexes of df1 and df2 are discarded in df3 column level.
df1['newCol'] = df1['col2'].map(dict1) pd.value_counts(cat)
Use ALL overlapping column names as the keys Default is to stack/unstack innermost level. If
* to merge. Good practice is to specify the keys : * want a different level, i.e. stack(level = # Maps col2 value as dict1‘s key, gets dict1‘s value cat = pd.cut(array1, numofBins) # Compute
on = [‘col2’, ‘col3’]. 0) - the outermost level. equal-length bins based on min and max values in array1
df1['newCol'] = df1['col2'].map(func1)
If different key name in df1 and df2, use option : cat = pd.qcut(array1, numofBins)# Bins the
* Note : Unstacking might introduce missing data if # Apply a function to each col2 value
left_on=’lkey’, right_on=’rkey’ data based on sample quantiles - roughly equal-size bins
not all of the values in the level aren’t found in each 3. Replacing Values
• Merge via Row (Uncommon) of the subgroups. Stacking filters out missing data 6. Detecting and Filtering Outliers
df3 = pd.merge(df1, df2, left_index = by default, i.e. data.unstack().stack() # Replace is NOT In-Place • any() test along an axis if any element is “True”.
True, right_index = True) Default is test along column (axis = 0).
df2 = df1.replace(np.nan, 100)
# Use indexes as merge key : aka rows with same index 2. Pivoting # Replace Multiple Values At Once df1[(np.abs(df1) > 3).any(axis = 1)]
value are joined together. • Common format of storing multiple “time series” in # Select all rows having a value > 3 or < -3.
databases and CSV is : df2 = df1.replace([-1, np.nan], 100)
2. pd.concat() : glues or stacks objects along an df2 = df1.replace([-1, np.nan], [1, 2]) # Another useful function : np.sign() returns 1 or -1.
axis (default is along “rows : axis = 0”). Long/Stacked Format : “date, stock_name, price”
# Argument Can Be a Dict As Well 7. Permutation and Random Sampling
df3 = pd.concat([df1, df2], ignore_index
• However, a DF with these 3 columns data like above randomOrder = np.random.permutation(df1.
= True) # ignore_index = True : Discard indexes in df3 df2 = df1.replace({-1: 1, np.nan : 2})
will be difficult to work with. Thus, “wide” format shape[0])
# If df1 has 2 rows, df2 has 3 rows, then df3 has 5 rows is prefered : ‘date’ as row index, ‘stock_name’ as 4. Renaming Axis Indexes
df2 = df1.take(randomOrder)
3. combine_first() : combine data with overlap, columns, ‘price’ as DF data values.
Convert Index df1.index = df1.index. 8. Computing Indicator/Dummy Variables
patching missing value. pivotedDf2 = df1.pivot('date', 'stock_ to Upper Case map(str.upper)
name', 'price') • If a column in DF has “K” distinct values, derive a
df3 = df1.combine_first(df2) df2 = df1.rename(index = “indicator” DF containing K columns of 0s and 1s.
# df1 and df2 indexes overlap in full or part. If a row NOT # Example pivotedDf2 : Rename {'row1' : 'newRow1'}, columns 1 means exist, 0 means NOT exist.
exist in df1 but in df2, it will be in df3. If row1 of df1 and # AAPL IBM JD ‘row1’ to = str.upper) dummyDf = pd.get_dummies(df1['col2'],
row3 of df2 have the same index value, but row1’s col3 ‘newRow1’
# 2003-06-01 120.2 100.1 20.8 # Optionally inplace = True prefix = 'col-')# Add prefix to the K column names
value is NA, df3 get this row with the col3 data from df2
Getting Data Descriptive Statistics Methods †

# Example : Correlation
TEXT FORMAT (CSV) JSON (JAVASCRIPT OBJECT NOTATION) DATA Compared with equivalent methods of ndArray,
† descriptive statistics methods in Pandas are built import pandas_datareader.data as web
df1 = pd.read_csv(file/URL/file-like-object, One of the standard formats for sending data by HTTP
sep = ',', header = None) from the ground up to exclude missing data. data = {}
request between web browsers and other applications.
# Type-Inference : do NOT have to specify which columns are It is much more flexible data format than tabular text from † NA (i.e. NaN) values are excluded. This can be for ticker in ['AAPL', 'JD']:
numeric, integer, boolean or string. like CSV. disabled using the "skipna = False" option.
data[ticker] = web.get_data_
# In Pandas, missing data in the source data is usually empty Convert JSON string Column Sums (Use axis = 1 to sum over rows) yahoo(ticker, '1/1/2000', '1/1/2010')
string, NA, -1, #IND or NULL. You can specify missing values data = json.load(jsonObj)
to Python form
via option i.e. : na_values = ['NULL']. series1 = df1.sum() prices = pd.DataFrame({ticker : d['Adj
Convert Python object asJson = json.dumps(data) Returns Index Labels Where Min/Max Values are Attained Close'] for ticker, d in data.iteritems()})
# Default delimiter is comma. to JSON
# Default is first row is the column header. df1 = df1.idxmin() or df1.idxmax() volumes = ...
df1 = pd.read_csv(.., names = [..]) Create DF from JSON pd.DataFrame(data['name'], Mutiple Summary Statistics (i.e. count, mean, std) returns = prices.pct_change()
columns = ['field1']) On Non-Numeric Data, Alternate Statistics (i.e. count, unique)
# Explicitly specify column header, also imply first row is data returns.AAPL.corr(returns.JD)
df1 = pd.read_csv(.., names = [.., XML AND HTML DATA df1.describe()
'date'], index_col = 'date') # Series corr() computes correlation of overlapping, non-NA,
HTML : aligned-by-index values in two Series.
# Want 'date' column to be row index of the returned DF doc = lxml.html. CORRELATION AND COVARIANCE
df1.to_csv(filepath/sys.stdout, sep = ',') parse(urlopen('http://..')).getroot()
tables = doc.findall('.//table') • cov(), corr() Created by Arianne Colton and Sean Chen
# Missing values appear as empty strings in the output. Thus, rows = tables[1].findall('.//tr')
It is better to add option i.e. : na_rep = 'NULL' • corrwith() - pairwise correlations : aka compute www.datasciencefree.com
XML : a DF with a Series. If input is not Series, but another Based on content from
# Default is row and column labels are written. Disabled by lxml.objectify.parse(open(filepath)). DF, it will compute the correlations of matching column “Python for Data Analysis” by Wes McKinney
options : index = False, header = False getroot() names. i.e. returns.corrwith(volumes) Updated: August 22, 2016
Time Series
• Python standard library data types for date and time : DATE RANGES, FRQUENCIES AND SHIFTING * NY is 4 hours behind UTC during daylight saving ts1.resample('5min', how = 'ohlc')
“datetime”, “time”, “calendar”. † Generic time series in Pandas are assumed to be irreg- time and 5 hours the rest of the year. # returns a DF with 4 columns - open, high, low , close
• Pandas data type for date and time : “Timestamp”. * ular, aka have no fixed frequency. However, we prefer to
1. Python Time Zone (From 3rd-party pytz library) * Alternate way to downsample : ts1.
work with fixed frequency, i.e. daily, monthly, etc.
Convert String from datetime import datetime pytz.common_timezones groupby(lamba x : x.month).mean()
Get List of Timezone Names
to DateTime Take a Look at # Convert to Fixed Daily Frequency. 2. Upsampling and Interpolation * - Interpolate
datetime.strptime('8/8/2008', “Resampling” # Introduce Missing Value (NaN) If Needed pytz.timezone('US/
Get a Timezone Object low frequency to higher frequency. By default missing
'%m/%d/%Y') Section Eastern')
ts1.resample('D', how = ..) values (NaN) are introduced.
Get Time Now now = datetime.now()
1. Frequencies and Date Offsets 2. Localization and Conversion df1.resample('D', fill_method = 'ffill')
DateTime from datetime import timedelta • Frequencies in Pandas are composed of a base # Forward fills values : i.e. missing value index such as
Time Series By Default is ts1.index.tz => None
Arithmetic datetime(2011, 1, 8) + frequency and a multiplier. Base frequencies are Time Zone Naive index 3 will copy value from index 2.
timedelta(12) => 2011-01-20 typically referred to by a string alias, like ‘M’ for monthly
or ‘H’ for hourly. Specify Time Zone When Use option : tz = 'UTC' in
# Timedelta represents temporal difference Create Time Series pd.date_range() * Interpoation will ONLY apply row-wise.
between two datetime objects. freq = '4H'
Localization From Naive ts1_utc = ts1. TIME SERIES PLOTTING
timestamps = pd.to_ freq = '1h30min'
Convert String tz_localize('UTC')
to Pandas datetime(['8/8/2008', ..]) # Standard US equity option monthly expirataion, every third # Example : Source Data Format - First Column is Date.
Timestamp Friday of the month : freq = 'WOM-3FRI' Convert to Another Time ts1_eastern = ts1_utc.
# NaT (Not a Time) is Pandas NA Value for Zone Once Time Series tz_convert('US/ Use first column as the Index, then parse the index values as
Type Timestamp Data 2. Generating Date Ranges Been Localized Eastern') Date.
pd.to_datetime('') => NaT prices = pd.read_csv(.., parse_date =
pd.isnull(NaT) => True Default pd.date_range(begin, end) Or 3. ** Time Zone-aware Timestamp Objects True, index_col = 0)
Frequency pd.date_range(begin or end,
# Missing value (i.e. empty string) is Daily periods = n) stamp_utc = pd.Timestamp('2008-08-08 px = prices[['AAPL', 'IBM']]
03:00', tz = 'UTC') px = px.resample('B', fill_method = 'ffill')
† “datetime” is widely used, it stores both the date # Option freq = 'BM' means last
business day at end of the month stamp_eastern = stamp_utc.tz_convert(...) px['AAPL'].plot()
and time down to microsecond.
* “Timestamp” object can be substituted anywhere 3. Shifting (Leading and Lagging) Data Panda’s Time Arithmetic - Daylight Savings Time Transitions px['AAPL'].ix['01-2008':'03-2012'].plot()
you would use “datetime” object. • Shifting refers to moving data backward and forward Are Respected : px.ix['2008'].plot()
PANDA TIME SERIES through time. stamp = pd.Timestamp('2012-11-04 00:30',
• Series and DF “shift()” does naive shift, aka index does tz = 'US/Eastern') => 2012-11-04-00:30:00 -400 EDT MOVING WINDOW FUNCTIONS
Create Time Series
not shift, only value. * stamp + 2 * Hour() => 2012-11-04-01:30:00 -500 EST Like other statistical functions, these functions also
ts1 = pd.Series(np.random.randn(8), index = automatically exclude missing data.
[ datetime(2011, 1, 2), .. ]) # ts1 is Daily Data ** If two time series with different time zones are
ts1 = pd.Series(..., index = pd.date_ ts1.shift(1) # move yesterday’s value to today, today combined, i.e. ts1 + ts2, the timestamps will auto-align pd.rolling_mean(px.AAPL, 200).plot()
range('1/1/2000', periods = 1000)) value to tomorrow, etc. with respect to time zone. The result will be in UTC.
pd.rolling_std(px.AAPL.pct_change(), 22,
# ts1.index is "DatetimeIndex" Panda class # ts1 is Any Time Series Data. Shift Data By 3 Days RESAMPLING min_periods = 20).plot()
ts1.shift(3, freq = 'D') Or Process of converting a time series from one frequency to pd.rolling_corr(px.AAPL.pct_change(),
Index value ts1.index[0] is Panda another frequency : px.IBM.pct_change(), 22).plot()
“Timestamp” object which stores timestamp using ts1.shift(1, freq = '3D')
NumPy’s “datetime64” type at the nanoseond 1. Downsampling - Aggregating higher frequency
† resolution. Further, Timestamp class stores the # Common Use of Shift : To Computer % Change data to lower frequency. PERFORMANCE
frequency information as well as timezone. ts1 / ts.shift(1) - 1 • Since “Timestamps” is represented as 64-bit integers
* ts1.resample('M', how = 'mean') using NumPy’s datetime64 type, it means for each data
ts1.index.dtype => datetime64[ns]
* In the return result from shift(), some data value => Index : 2000-01-31, 2000-02-29, ... point, there is an associated 8 bytes of memory per
Indexing (Slicing/Subsetting) timestamp.
might be NaN. ts1.resample('M', ..., kind ='period')
Argument can be a string date, datetime or Timestamp. • Other ways to shift data : # 'period' - Use time-span representation • “Creating views” on existing time series or DF do
Select Year of 2001 ts1['2001'] from pandas.tseries.offsets import Day, => Index : 2000-01, 2000-02, ... not cause any more memory to be used.
df1.ix['2001'] MonthEnd
# ts1 is one minute data of value 1 to 100 of time : • Indexes for lower frequencies (daily and up) are stored
ts1['2001-06'] datetime(2008, 8, 8) + 3*Day() => 2008-08-11 00:00:00, 00:01:00, ... in a central cache, so any fixed-frequency index
Select June 2001
is a view on the date cache.Thus, low-frequency
Select From 2001- ts1['1/1/2001':'8/1/2001'] datetime(2008, 8, 8) + MonthEnd(2) => ts1.resample('5min', how = 'sum') => indexes memory footprint is not significant.
01-01 to 2001-08-01 2008-09-30
00:00:00 15 (aka : 1 + 2 + 3 + 4 + 5) • Performance-wise, Pandas has been highly optimized
MonthEnd().rollforward(datetime(2008, 8, 00:05:00 40 for data alignment operations (i.e. ts1 + ts2) and
Select From 2001- ts1[datetime(2001, 1, 8):]
8)) => 2008-08-31 resampling.
01-08 On # Default is left bin edge is inclusive, thus 00:00:00 value in
included in the 00:00:00 to 00:05:00 interval.
Common Operations\ TIME ZONE HANDLING
# Option : closed = 'right' change interval to right Created by Arianne Colton and Sean Chen
• Daylight saving time (DST) transitions are a
Get Time Series ts1.truncate(after = common source of complication. inclusive. Also include option label = 'right' as well : www.datasciencefree.com
Data Before '1/8/2011') • UTC is the current international standard. Time zones 00:00:00 1 Based on content from
2011-01-09 are expressed as offsets from UTC. * 00:05:00 20 (aka : 2 + 3 + 4 + 5 + 6) “Python for Data Analysis” by Wes McKinney
Updated: August 22, 2016

Data Analysis With PANDAS CHEAT SHEET PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis With PANDAS CHEAT SHEET PDF

Uploaded by

Copyright:

Available Formats

Data Structures continued

Data Analysis with PANDAS * DF has a “to_panel()” method which is the

INDEX OBJECTS ** Data selection performance is much better if

• i.e. Index, MultiIndex, DatetimeIndex, PeriodIndex Summary Statistics by Level

Getting Data Descriptive Statistics Methods †

You might also like