You are on page 1of 4

Data Structures continued

Data Analysis with PANDAS * DF has a “to_panel()” method which is the


inverse of “to_frame()”.
Common Ops :
Swap and Sort **
series1.swaplevel(0,
1).sortlevel(0)

CHEAT SHEET
# the order of rows also change
** Hierarchical indexing makes N-dimensional
arrays unnecessary in a lot of cases. Aka * The order of the rows do not change. Only the
Created By: Arianne Colton and Sean Chen prefer to use Stacked DF, not Panel data. two levels got swapped.

INDEX OBJECTS ** Data selection performance is much better if


the index is sorted starting with the outermost
level, as a result of calling sortlevel(0) or
Data Structures Immutable objects that hold the axis labels and other
metadata (i.e. axis name) sort_index().

• i.e. Index, MultiIndex, DatetimeIndex, PeriodIndex Summary Statistics by Level


SERIES (1D) Get Columns and df1.columns • Any sequence of labels used when constructing
One-dimensional array-like object containing an array of Row Names df1.index Series or DF internally converted to an Index. Most stats functions in DF or Series have a “level”
data (of any NumPy data type) and an associated array Get Name option that you can specify the level you want on an
Attribute
df1.columns.name • Can functions as fixed-size set in additional to being axis.
of data labels, called its “index”. If index of data is not
specified, then a default one consisting of the integers 0 df1.index.name array-like. Sum rows (that
(None is default)
through N-1 is created. df1.values HIERARCHICAL INDEXING have same ‘key2’ df1.sum(level = 'key2')
value)
series1 = pd.Series ([1, Get Values # returns the data as a 2D ndarray, the Multiple index levels on an axis : A way to work with df1.sum(level = 'col3', axis
Create Series 2], index = ['a', 'b']) dtype will be chosen to accomandate all of Sum columns ..
the columns higher dimensional data in a lower dimensional form. = 1)
series1 = pd.Series(dict1)*
** Get Column as df1['state'] or df1.state MultiIndex :
Get Series Values series1.values Series • Under the hood, the functionality provided here
Get Values by Index
series1['a'] ** Get Row as series1 = Series(np.random.randn(6), index = utilizes panda’s “groupby”.
series1[['b','a']] df1.ix['row2'] or df1.ix[1] [['a', 'a', 'a', 'b', 'b', 'b'], [1, 2, 3,
Series
Get Series Index series1.index 1, 2, 3]]) DataFrame’s Columns as Indexes
Assign a column
Get Name Attribute series1.name that doesn’t exist df1['eastern'] = df1.state
series1.index.names = ['key1', 'key2'] DF’s “set_index” will create a new DF using one or more
will create a new == 'Ohio' of its columns as the index.
(None is default) series1.index.name column
Series Partial series1['b'] # Outer Level df2 = df1.set_index(['col3',
** Common Index series1 + series2 Delete a column del df1['eastern']
Values are Added Indexing series1[:, 2] # Inner Level 'col4']) * ‡
Switch Columns df1.T New DF using
Unique But Unsorted series2 = series1.unique() and Rows df1['outerCol3','InnerCol2'] # col3 becomes the outermost index, col4
DF Partial columns as index
Or becomes inner index. Values of col3, col4
Indexing
* Can think of Series as a fixed-length, ordered * Dicts of Series are treated the same as Nested df1['outerCol3']['InnerCol2'] become the index values.
dict. Series can be substitued into many dict of dicts.
functions that expect a dict. Swaping and Sorting Levels * "reset_index" does the opposite of "set_index",
** Data returned is a ‘view’ on the underlying the hierarchical index are moved into columns.
** Auto-align differently-indexed data in arithmetic data, NOT a copy. Thus, any in-place Swap Level (level swapSeries1 = series1.
operations interchanged) * swaplevel('key1', 'key2')
modificatons to the data will be reflected in df1. ‡ By default, 'col3' and 'col4' will be removed
series1.sortlevel(1) from the DF, though you can leave them by
DATAFRAME (2D) Sort Level
# sorts according to first inner level option : 'drop = False'.
PANEL DATA (3D)
Tabular data structure with ordered collections of
columns, each of which can be different value type. Create Panel Data : (Each item in the Panel is a DF)
Data Frame (DF) can be thought of as a dict of Series.
dict1 = {'state': ['Ohio',
import pandas_datareader.data as web
panel1 = pd.Panel({stk : web.get_data_
Missing Data
'CA'], 'year': [2000, 2010]} yahoo(stk, '1/1/2000', '1/1/2010')
for stk in ['AAPL', 'IBM']}) df1.dropna(how = 'all') # drop row that are all
Python NaN - np.nan(not a number)
df1 = pd.DataFrame(dict1) # panel1 Dimensions : 2 (item) * 861 (major) * 6 (minor) missing
NaN or python built-in None mean
Create DF # columns are placed in sorted order Pandas * df1.dropna(thresh = 3) # drop any row containing
“Stacked” DF form : (Useful way to represent panel data) missing/NA values
(from a dict of < 3 number of observations
df1 = pd.DataFrame(dict1,
equal-length lists index = ['row1', 'row2'])) panel1 = panel1.swapaxes('item', 'minor') * Use pd.isnull(), pd.notnull() or
or NumPy arrays)
series1/df1.isnull() to detect missing data.
FILLING IN MISSING DATA
# specifying index panel1.ix[:, '6/1/2003', :].to_frame() *
df2 = df1.fillna(0) # fill all missing data with 0
df1 = pd.DataFrame(dict1, => Stacked DF (with hierarchical indexing **) :
columns = ['year', 'state']) FILTERING OUT MISSING DATA df1.fillna('inplace = True') # modify in-place
# Open High Low Close Volume Adj-Close
# columns are placed in your given order Use a different fill value for each column :
# major minor dropna() returns with ONLY non-null data, source
* Create DF data NOT modified. df1.fillna({'col1' : 0, 'col2' : -1})
dict1 = {'col1': {'row1': 1, # 2003-06-01 AAPL Only forward fill the 2 missing values in front :
(from nested dict 'row2': 2}, 'col2': {'row1':
of dicts) 3, 'row2': 4} } # IBM df1.dropna() # drop any row containing missing value df1.fillna(method = 'ffill', limit = 2)
The inner keys as df1 = pd.DataFrame(dict1) # 2003-06-02 AAPL df1.dropna(axis = 1) # drop any column i.e. for column1, if row 3-6 are missing. so 3 and 4 get filled
row indices containing missing values
# IBM with the value from 2, NOT 5 and 6.
Essential Functionality Data Aggregation and Group Operations
INDEXING (SLICING/SUBSETTING) † ARITHMETIC AND DATA ALIGNMENT Categorizing a data set and applying a function to DATA AGGREGATION
each group, whether an aggregation or transformation.
Same as ‘NdArray’ : In INDEXING : ‘view’ • df1 + df2 : For indices that don’t overlap, Data aggregation means any data transformation that

of the source array is returned. internal data alignment introduces NaN. Aggregation of “Time Series” data - please produces scalar values from arrays, such as “mean”,
1, Instead of NaN, replace with 0 for the indice that is not
Note see Time Series section. Special use case of “max”, etc.
Endpoint is inclusive in pandas slicing with def func1(array): ...
found in th df : “groupby” is used - called “resampling”. Use Self-Defined
† labels : series1['a':'c'] where
Function
Python slicing is NOT. Note that pandas non- df1.add(df2, fill_value = 0) GROUPBY (SPLIT-APPLY-COMBINE) grouped.agg(func1)
Get DF with Column
label (i.e. integer) slicing is still non-inclusive. 2, Useful Operations : - Similar to SQL groupby Names as Fuction grouped.agg([mean, std])
df1 - df1.ix[0] # subtract every row in df1 by first row Names
Index by Column(s) df1['col1']
Compute Group Mean df1.groupby('col2').mean() Get DF with Self- grouped.agg([('col1',
df1[ ['col1', 'col3'] ] Defined Column
SORTING AND RANKING df1.groupby([df1['col2'],
Names mean), ('col2', std)])
Index by Row(s) df1.ix['row1'] df1['col3']]).mean()
GroupBy More Than Use Different Fuction
df1.ix[ ['row1', 'row3'] ] Sort Index/Column † One Key Depending on the
grouped.agg({'col1' : [min,
• sort_index() returns a new, sorted object. Default # result in hierarchical index consisting max], 'col3' : sum})
Index by Both df1.ix[['row2', 'row1'], of unique pairs of keys Column
Column(s) and 'col3'] is “ascending = True”.
“GroupBy” Object : grouped = df1['col1'].
Row(s) • Row index are sorted by default, “axis = 1” is used
groupby(df1['col2'])
GROUP-WISE OPERATIONS AND
for sorting column. (ONLY computed
Boolean Indexing df1[ [True, False] ] intermediate data TRANSFORMATIONS
about the group key grouped.mean() # gets the mean
df1[df1['col2'] > 6] * † Sorting Index/Column means sort the row/ of each group formed by 'col2' Agg() is a special case of data transformation, aka
# returns df that has col2 value > 6 - df1['col2']
column labels, not sorting the data. reduce a one-dimensional array to scalar.
# select ‘col1’ for aggregation :
Note that df1['col2'] > 6 returns a Sort Data Transform() is a specialized data transformation :
df1.groupby('col2')['col1']
boolean Series, with each True/False value Indexing “GroupBy” • It applies a function to each group, if it produces
* Missing values (np.nan) are sorted to the end of the Object or
a scalar value, the value will be placed in every
determine whether the respective row in the Series by default df1['col1']. row of the group. Thus, if DF has 10 rows, after
result. Series Sorting sortedS1 = series1.order() groupby(df1['col2']) “transform()”, there will be still 10 rows, each one with
Avoid integer indexing since it might the scalar value from its respective group’s value from
series1.sort() # In-place sort Any missing values in the group are excluded the function.
introduce subtle bugs (e.g. series1[-1]). Note
Note If have to use position-based indexing, DF Sorting df1.sort_index(by = from the result. • The passed function must either produce a scalar
use "iget_value()" from Series and ['col2', 'col1'])
value or a transformed array of same size.
"irow/icol()" from DF instead of # sort by col2 first then col1 1. Iterating over GroupBy object General purpose transformation : apply()
integer indexing. Ranking “GroupBy” object supports iteration : generating a df1.groupby('col2').apply(your_func1)
sequence of 2-tuples containing the group name along
DROPPING ROWS/COLUMNS Break rank ties by assigning each tie-group the mean with the chunk of data. # your func ONLY need to return a pandas object or a scalar.
rank. (e.g. 3, 3 are tie as the 5th place; thus, the result is # Example 1 : Yearly Correlations with SPX
Drop operation returns a new object (i.e. DF) : 5.5 for each) for name, groupdata in df1.groupby('col2'):
# “close_price” is DF with stocks and SPX closed price columns
Remove Row(s) df1.drop('row1') Output Rank of series1.rank() # name is single value, groupdata is filtered DF contains data and dates index
(axis = 0 is default) df1.drop(['row1', 'row3']) Each Element only match that single value.
df1.rank(axis = 1) for (k1, k2), groupdata in df1. returns = close_price.pct_change().dropna()
df1.drop('col2', axis = 1) (Rank start from 1) # rank each row’s value
Remove Column(s) groupby(['col2', 'col3']): by_year = returns.groupby(lambda x :
# If groupby multiple keys : first element in the tuple is a tuple x.year)
REINDEXING FUNCTION APPLICATIONS of key values.
spx_corr = lambda x : x.corrwith(x['SPX'])
Create a new object with rearraging data conformed to a NumPy works fine with pandas objects : np.abs(df1) Convert Groups dict(list(df1.groupby('col2'))) by_year.apply(spx_corr)
new index, introducing missing values if any index values to Dict # col2 unique values will be keys of dict # Example 2 : Exploratory Regression
were not already present. f = lambda x: x.max() - grouped = df1.groupby([df1.
Applying a import statsmodels.api as sm
Change df1 Date date_index = pd.date_ Function to Each x.min() # return a scalar value Group Columns dtypes, axis = 1)
Index Values to the range('01/23/2010', Column or Row by “dtype” def regress(data, y, x):
def f(x): return dict(list(grouped))
New Index Values periods = 10, freq = 'D') (Default is to apply Y = data[y]; X = data[x]
Series([x.max(), x.min()]) # separates data Into different types
to each column :
axis = 0) # return multiple values X['intercept'] = 1
(ReIndex default is
row index) df1.reindex(date_index) df1.apply(f) 2. Grouping with functions result = sm.OLS(Y, X).fit()
Replace Missing df1.reindex(date_index, Applying a f = lambda x: '%.2f' %x Any function passed as a group key will be called once return result.params
Values (NaN) wth 0 fill_value = 0) Function
df1.applymap(f) per (default is row index) value, with the return values
Element-Wise being used as the group names. (This assumes row by_year.apply(regress, 'AAPL', ['SPX'])
df1.reindex(columns = # format each entry to 2-decimals
ReIndex Columns index are named)
['a', 'b'])
UNIQUE, COUNTS df1.groupby(len).sum() Created by Arianne Colton and Sean Chen
ReIndex Both Rows df1.reindex(index = [..],
and Columns columns = [..]) • It’s NOT mandatory for index labels to be unique # returns a DF with row index that are length of the names. www.datasciencefree.com
although many functions require it. Check via : Thus, names of same length will sum their values. Column Based on content from
Succinct ReIndex df1.ix[[..], [..]] series1/df1.index.is_unique names retain. “Python for Data Analysis” by Wes McKinney
• pd.value_counts() returns value frequency.
Updated: August 22, 2016
Data Wrangling : Merge, Reshape, Clean, Transform
COMBINING AND MERGING DATA RESHAPING AND PIVOTING COMMON OPERATIONS 5. Discretization and Binning
• Continuous data is often discretized into “bins” for
1. pd.merge() aka database “join” : connects 1. Reshaping with Hierarchical Indexing 1. Removing Duplicate Rows analysis.
rows in DF based on one or more keys. series1 = df1.stack()
• Merge via Column (Common) series1 = df1.duplicated() # Boolean series1 # Divide Data Into 2 Bins of Number (18 - 26], (26 - 35]
# Rotates (innermost level *) columns to rows as innermost indicating whether each row is a duplicate or not. # ‘]’ means inclusive, ‘)’ is NOT inclusive
df3 = pd.merge(df1, df2, on = 'col2') *
index level, resulted in Series with hierarchical index. df2 = df1.drop_duplicates()# Duplicates has bins = [18, 26, 35]
# INNER join is default Or use option : how = 'outer/ df1 = series1.unstack() been dropped in df2.
left/right' cat = pd.cut(array1, bins, labels=[..])
# Rotates (innermost level *) rows to columns as innermost 2. Add New Column Based On Value of Column(s) # cat is “Categorical” object.
# the indexes of df1 and df2 are discarded in df3 column level.
df1['newCol'] = df1['col2'].map(dict1) pd.value_counts(cat)
Use ALL overlapping column names as the keys Default is to stack/unstack innermost level. If
* to merge. Good practice is to specify the keys : * want a different level, i.e. stack(level = # Maps col2 value as dict1‘s key, gets dict1‘s value cat = pd.cut(array1, numofBins) # Compute
on = [‘col2’, ‘col3’]. 0) - the outermost level. equal-length bins based on min and max values in array1
df1['newCol'] = df1['col2'].map(func1)
If different key name in df1 and df2, use option : cat = pd.qcut(array1, numofBins)# Bins the
* Note : Unstacking might introduce missing data if # Apply a function to each col2 value
left_on=’lkey’, right_on=’rkey’ data based on sample quantiles - roughly equal-size bins
not all of the values in the level aren’t found in each 3. Replacing Values
• Merge via Row (Uncommon) of the subgroups. Stacking filters out missing data 6. Detecting and Filtering Outliers
df3 = pd.merge(df1, df2, left_index = by default, i.e. data.unstack().stack() # Replace is NOT In-Place • any() test along an axis if any element is “True”.
True, right_index = True) Default is test along column (axis = 0).
df2 = df1.replace(np.nan, 100)
# Use indexes as merge key : aka rows with same index 2. Pivoting # Replace Multiple Values At Once df1[(np.abs(df1) > 3).any(axis = 1)]
value are joined together. • Common format of storing multiple “time series” in # Select all rows having a value > 3 or < -3.
databases and CSV is : df2 = df1.replace([-1, np.nan], 100)
2. pd.concat() : glues or stacks objects along an df2 = df1.replace([-1, np.nan], [1, 2]) # Another useful function : np.sign() returns 1 or -1.
axis (default is along “rows : axis = 0”). Long/Stacked Format : “date, stock_name, price”
# Argument Can Be a Dict As Well 7. Permutation and Random Sampling
df3 = pd.concat([df1, df2], ignore_index
• However, a DF with these 3 columns data like above randomOrder = np.random.permutation(df1.
= True) # ignore_index = True : Discard indexes in df3 df2 = df1.replace({-1: 1, np.nan : 2})
will be difficult to work with. Thus, “wide” format shape[0])
# If df1 has 2 rows, df2 has 3 rows, then df3 has 5 rows is prefered : ‘date’ as row index, ‘stock_name’ as 4. Renaming Axis Indexes
df2 = df1.take(randomOrder)
3. combine_first() : combine data with overlap, columns, ‘price’ as DF data values.
Convert Index df1.index = df1.index. 8. Computing Indicator/Dummy Variables
patching missing value. pivotedDf2 = df1.pivot('date', 'stock_ to Upper Case map(str.upper)
name', 'price') • If a column in DF has “K” distinct values, derive a
df3 = df1.combine_first(df2) df2 = df1.rename(index = “indicator” DF containing K columns of 0s and 1s.
# df1 and df2 indexes overlap in full or part. If a row NOT # Example pivotedDf2 : Rename {'row1' : 'newRow1'}, columns 1 means exist, 0 means NOT exist.
exist in df1 but in df2, it will be in df3. If row1 of df1 and # AAPL IBM JD ‘row1’ to = str.upper) dummyDf = pd.get_dummies(df1['col2'],
row3 of df2 have the same index value, but row1’s col3 ‘newRow1’
# 2003-06-01 120.2 100.1 20.8 # Optionally inplace = True prefix = 'col-')# Add prefix to the K column names
value is NA, df3 get this row with the col3 data from df2

Getting Data Descriptive Statistics Methods †


# Example : Correlation
TEXT FORMAT (CSV) JSON (JAVASCRIPT OBJECT NOTATION) DATA Compared with equivalent methods of ndArray,
† descriptive statistics methods in Pandas are built import pandas_datareader.data as web
df1 = pd.read_csv(file/URL/file-like-object, One of the standard formats for sending data by HTTP
sep = ',', header = None) from the ground up to exclude missing data. data = {}
request between web browsers and other applications.
# Type-Inference : do NOT have to specify which columns are It is much more flexible data format than tabular text from † NA (i.e. NaN) values are excluded. This can be for ticker in ['AAPL', 'JD']:
numeric, integer, boolean or string. like CSV. disabled using the "skipna = False" option.
data[ticker] = web.get_data_
# In Pandas, missing data in the source data is usually empty Convert JSON string Column Sums (Use axis = 1 to sum over rows) yahoo(ticker, '1/1/2000', '1/1/2010')
string, NA, -1, #IND or NULL. You can specify missing values data = json.load(jsonObj)
to Python form
via option i.e. : na_values = ['NULL']. series1 = df1.sum() prices = pd.DataFrame({ticker : d['Adj
Convert Python object asJson = json.dumps(data) Returns Index Labels Where Min/Max Values are Attained Close'] for ticker, d in data.iteritems()})
# Default delimiter is comma. to JSON
# Default is first row is the column header. df1 = df1.idxmin() or df1.idxmax() volumes = ...
df1 = pd.read_csv(.., names = [..]) Create DF from JSON pd.DataFrame(data['name'], Mutiple Summary Statistics (i.e. count, mean, std) returns = prices.pct_change()
columns = ['field1']) On Non-Numeric Data, Alternate Statistics (i.e. count, unique)
# Explicitly specify column header, also imply first row is data returns.AAPL.corr(returns.JD)
df1 = pd.read_csv(.., names = [.., XML AND HTML DATA df1.describe()
'date'], index_col = 'date') # Series corr() computes correlation of overlapping, non-NA,
HTML : aligned-by-index values in two Series.
# Want 'date' column to be row index of the returned DF doc = lxml.html. CORRELATION AND COVARIANCE
df1.to_csv(filepath/sys.stdout, sep = ',') parse(urlopen('http://..')).getroot()
tables = doc.findall('.//table') • cov(), corr() Created by Arianne Colton and Sean Chen
# Missing values appear as empty strings in the output. Thus, rows = tables[1].findall('.//tr')
It is better to add option i.e. : na_rep = 'NULL' • corrwith() - pairwise correlations : aka compute www.datasciencefree.com
XML : a DF with a Series. If input is not Series, but another Based on content from
# Default is row and column labels are written. Disabled by lxml.objectify.parse(open(filepath)). DF, it will compute the correlations of matching column “Python for Data Analysis” by Wes McKinney
options : index = False, header = False getroot() names. i.e. returns.corrwith(volumes) Updated: August 22, 2016
Time Series
• Python standard library data types for date and time : DATE RANGES, FRQUENCIES AND SHIFTING * NY is 4 hours behind UTC during daylight saving ts1.resample('5min', how = 'ohlc')
“datetime”, “time”, “calendar”. † Generic time series in Pandas are assumed to be irreg- time and 5 hours the rest of the year. # returns a DF with 4 columns - open, high, low , close
• Pandas data type for date and time : “Timestamp”. * ular, aka have no fixed frequency. However, we prefer to
1. Python Time Zone (From 3rd-party pytz library) * Alternate way to downsample : ts1.
work with fixed frequency, i.e. daily, monthly, etc.
Convert String from datetime import datetime pytz.common_timezones groupby(lamba x : x.month).mean()
Get List of Timezone Names
to DateTime Take a Look at # Convert to Fixed Daily Frequency. 2. Upsampling and Interpolation * - Interpolate
datetime.strptime('8/8/2008', “Resampling” # Introduce Missing Value (NaN) If Needed pytz.timezone('US/
Get a Timezone Object low frequency to higher frequency. By default missing
'%m/%d/%Y') Section Eastern')
ts1.resample('D', how = ..) values (NaN) are introduced.
Get Time Now now = datetime.now()
1. Frequencies and Date Offsets 2. Localization and Conversion df1.resample('D', fill_method = 'ffill')
DateTime from datetime import timedelta • Frequencies in Pandas are composed of a base # Forward fills values : i.e. missing value index such as
Time Series By Default is ts1.index.tz => None
Arithmetic datetime(2011, 1, 8) + frequency and a multiplier. Base frequencies are Time Zone Naive index 3 will copy value from index 2.
timedelta(12) => 2011-01-20 typically referred to by a string alias, like ‘M’ for monthly
or ‘H’ for hourly. Specify Time Zone When Use option : tz = 'UTC' in
# Timedelta represents temporal difference Create Time Series pd.date_range() * Interpoation will ONLY apply row-wise.
between two datetime objects. freq = '4H'
Localization From Naive ts1_utc = ts1. TIME SERIES PLOTTING
timestamps = pd.to_ freq = '1h30min'
Convert String tz_localize('UTC')
to Pandas datetime(['8/8/2008', ..]) # Standard US equity option monthly expirataion, every third # Example : Source Data Format - First Column is Date.
Timestamp Friday of the month : freq = 'WOM-3FRI' Convert to Another Time ts1_eastern = ts1_utc.
# NaT (Not a Time) is Pandas NA Value for Zone Once Time Series tz_convert('US/ Use first column as the Index, then parse the index values as
Type Timestamp Data 2. Generating Date Ranges Been Localized Eastern') Date.
pd.to_datetime('') => NaT prices = pd.read_csv(.., parse_date =
pd.isnull(NaT) => True Default pd.date_range(begin, end) Or 3. ** Time Zone-aware Timestamp Objects True, index_col = 0)
Frequency pd.date_range(begin or end,
# Missing value (i.e. empty string) is Daily periods = n) stamp_utc = pd.Timestamp('2008-08-08 px = prices[['AAPL', 'IBM']]
03:00', tz = 'UTC') px = px.resample('B', fill_method = 'ffill')
† “datetime” is widely used, it stores both the date # Option freq = 'BM' means last
business day at end of the month stamp_eastern = stamp_utc.tz_convert(...) px['AAPL'].plot()
and time down to microsecond.
* “Timestamp” object can be substituted anywhere 3. Shifting (Leading and Lagging) Data Panda’s Time Arithmetic - Daylight Savings Time Transitions px['AAPL'].ix['01-2008':'03-2012'].plot()
you would use “datetime” object. • Shifting refers to moving data backward and forward Are Respected : px.ix['2008'].plot()
PANDA TIME SERIES through time. stamp = pd.Timestamp('2012-11-04 00:30',
• Series and DF “shift()” does naive shift, aka index does tz = 'US/Eastern') => 2012-11-04-00:30:00 -400 EDT MOVING WINDOW FUNCTIONS
Create Time Series
not shift, only value. * stamp + 2 * Hour() => 2012-11-04-01:30:00 -500 EST Like other statistical functions, these functions also
ts1 = pd.Series(np.random.randn(8), index = automatically exclude missing data.
[ datetime(2011, 1, 2), .. ]) # ts1 is Daily Data ** If two time series with different time zones are
ts1 = pd.Series(..., index = pd.date_ ts1.shift(1) # move yesterday’s value to today, today combined, i.e. ts1 + ts2, the timestamps will auto-align pd.rolling_mean(px.AAPL, 200).plot()
range('1/1/2000', periods = 1000)) value to tomorrow, etc. with respect to time zone. The result will be in UTC.
pd.rolling_std(px.AAPL.pct_change(), 22,
# ts1.index is "DatetimeIndex" Panda class # ts1 is Any Time Series Data. Shift Data By 3 Days RESAMPLING min_periods = 20).plot()
ts1.shift(3, freq = 'D') Or Process of converting a time series from one frequency to pd.rolling_corr(px.AAPL.pct_change(),
Index value ts1.index[0] is Panda another frequency : px.IBM.pct_change(), 22).plot()
“Timestamp” object which stores timestamp using ts1.shift(1, freq = '3D')
NumPy’s “datetime64” type at the nanoseond 1. Downsampling - Aggregating higher frequency
† resolution. Further, Timestamp class stores the # Common Use of Shift : To Computer % Change data to lower frequency. PERFORMANCE
frequency information as well as timezone. ts1 / ts.shift(1) - 1 • Since “Timestamps” is represented as 64-bit integers
* ts1.resample('M', how = 'mean') using NumPy’s datetime64 type, it means for each data
ts1.index.dtype => datetime64[ns]
* In the return result from shift(), some data value => Index : 2000-01-31, 2000-02-29, ... point, there is an associated 8 bytes of memory per
Indexing (Slicing/Subsetting) timestamp.
might be NaN. ts1.resample('M', ..., kind ='period')
Argument can be a string date, datetime or Timestamp. • Other ways to shift data : # 'period' - Use time-span representation • “Creating views” on existing time series or DF do
Select Year of 2001 ts1['2001'] from pandas.tseries.offsets import Day, => Index : 2000-01, 2000-02, ... not cause any more memory to be used.
df1.ix['2001'] MonthEnd
# ts1 is one minute data of value 1 to 100 of time : • Indexes for lower frequencies (daily and up) are stored
ts1['2001-06'] datetime(2008, 8, 8) + 3*Day() => 2008-08-11 00:00:00, 00:01:00, ... in a central cache, so any fixed-frequency index
Select June 2001
is a view on the date cache.Thus, low-frequency
Select From 2001- ts1['1/1/2001':'8/1/2001'] datetime(2008, 8, 8) + MonthEnd(2) => ts1.resample('5min', how = 'sum') => indexes memory footprint is not significant.
01-01 to 2001-08-01 2008-09-30
00:00:00 15 (aka : 1 + 2 + 3 + 4 + 5) • Performance-wise, Pandas has been highly optimized
MonthEnd().rollforward(datetime(2008, 8, 00:05:00 40 for data alignment operations (i.e. ts1 + ts2) and
Select From 2001- ts1[datetime(2001, 1, 8):]
8)) => 2008-08-31 resampling.
01-08 On # Default is left bin edge is inclusive, thus 00:00:00 value in
included in the 00:00:00 to 00:05:00 interval.
Common Operations\ TIME ZONE HANDLING
# Option : closed = 'right' change interval to right Created by Arianne Colton and Sean Chen
• Daylight saving time (DST) transitions are a
Get Time Series ts1.truncate(after = common source of complication. inclusive. Also include option label = 'right' as well : www.datasciencefree.com
Data Before '1/8/2011') • UTC is the current international standard. Time zones 00:00:00 1 Based on content from
2011-01-09 are expressed as offsets from UTC. * 00:05:00 20 (aka : 2 + 3 + 4 + 5 + 6) “Python for Data Analysis” by Wes McKinney
Updated: August 22, 2016

You might also like