You are on page 1of 2

Tidy Data – A foundation for wrangling in pandas

Data Wrangling
& *
Tidy data complements pandas’s vectorized
with pandas Cheat Sheet In a tidy operations. pandas will automatically preserve
http://pandas.pydata.org data set: observations as you manipulate variables. No
other format works as intuitively with pandas.
M A
Pandas API Reference Pandas User Guide Each variable is saved
in its own column
Each observation is
saved in its own row *
Creating DataFrames Reshaping Data – Change layout, sorting, reindexing, renaming
a b c df.sort_values('mpg')
1 4 7 10 Order rows by values of a column (low to high).
2 5 8 11
df.sort_values('mpg’, ascending=False)
3 6 9 12
Order rows by values of a column (high to low).
df = pd.DataFrame(
{"a" : [4, 5, 6], pd.melt(df) df.rename(columns = {'y':'year'})
df.pivot(columns='var', values='val')
Gather columns into rows. Rename the columns of a DataFrame
"b" : [7, 8, 9], Spread rows into columns.
"c" : [10, 11, 12]}, df.sort_index()
index = [1, 2, 3]) Sort the index of a DataFrame
Specify values for each column.
df.reset_index()
df = pd.DataFrame( Reset index of DataFrame to row numbers, moving
[[4, 7, 10], index to columns.
[5, 8, 11], pd.concat([df1,df2], axis=1)
pd.concat([df1,df2]) df.drop(columns=['Length’, 'Height'])
[6, 9, 12]], Append columns of DataFrames
Append rows of DataFrames Drop columns from DataFrame
index=[1, 2, 3],
columns=['a', 'b', 'c'])
Specify values for each row. Subset Observations - rows Subset Variables - columns Subsets - rows and columns
a b c Use df.loc[] and df.iloc[] to select only
N v rows, only columns or both.
1 4 7 10 Use df.at[] and df.iat[] to access a single
D df[df.Length > 7]
2 5 8 11 df[['width’, 'length’, 'species']] value by row and column.
e 2 6 9 12
Extract rows that meet logical criteria. Select multiple columns with specific names. First index selects rows, second index columns.
df.drop_duplicates() df['width'] or df.width
df.iloc[10:20]
df = pd.DataFrame( Remove duplicate rows (only considers columns). Select single column with specific name.
Select rows 10-20.
{"a" : [4 ,5, 6], df.sample(frac=0.5) df.filter(regex='regex')
df.iloc[:, [1, 2, 5]]
"b" : [7, 8, 9], Randomly select fraction of rows. Select columns whose name matches
Select columns in positions 1, 2 and 5 (first
"c" : [10, 11, 12]}, df.sample(n=10) Randomly select n rows. regular expression regex.
column is 0).
index = pd.MultiIndex.from_tuples( df.nlargest(n, 'value’)
[('d’, 1), ('d’, 2), Select and order top n entries. Using query df.loc[:, 'x2':'x4']
Select all columns between x2 and x4 (inclusive).
('e’, 2)], names=['n’, 'v'])) df.nsmallest(n, 'value')
query() allows Boolean expressions for filtering df.loc[df['a'] > 10, ['a’, 'c']]
Create DataFrame with a MultiIndex Select and order bottom n entries.
rows. Select rows meeting logical condition, and only
df.head(n)
df.query('Length > 7') the specific columns .
Select first n rows.
Method Chaining df.tail(n)
Select last n rows.
df.query('Length > 7 and Width < 8')
df.query('Name.str.startswith("abc")',
df.iat[1, 2] Access single value by index
df.at[4, 'A'] Access single value by label
Most pandas methods return a DataFrame so that engine="python")
another pandas method can be applied to the Logic in Python (and pandas) regex (Regular Expressions) Examples
result. < Less than != Not equal to '\.' Matches strings containing a period '.'
This improves readability of code.
> Greater than df.column.isin(values) Group membership 'Length$' Matches strings ending with word 'Length'
df = (pd.melt(df)
== Equals pd.isnull(obj) Is NaN '^Sepal' Matches strings beginning with the word 'Sepal'
.rename(columns={
'variable':'var', <= Less than or equals pd.notnull(obj) Is not NaN '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
'value':'val'}) >= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all '^(?!Species$).*' Matches strings except the string 'Species'
.query('val >= 200') Cheatsheet for pandas (http://pandas.pydata.org/ originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet
Summarize Data Handling Missing Data Combine Data Sets
df['w'].value_counts() df.dropna() adf bdf
Count number of rows with each unique value of variable x1 x2 x1 x3
Drop rows with any column having NA/null data.
len(df) A 1 A T
df.fillna(value)
# of rows in DataFrame. B 2 B F
Replace all NA/null data with value.
df.shape C 3 D T
Tuple of # of rows, # of columns in DataFrame.
df['w'].nunique() Make New Columns Standard Joins
x1 x2 x3 pd.merge(adf, bdf,
# of distinct values in a column.
A 1 T how='left', on='x1')
df.describe()
B 2 F Join matching rows from bdf to adf.
Basic descriptive and statistics for each column (or GroupBy).
C 3 NaN
df.assign(Area=lambda df: df.Length*df.Height)
x1 x2 x3 pd.merge(adf, bdf,
Compute and append one or more new columns.
A 1.0 T how='right', on='x1')
df['Volume'] = df.Length*df.Height*df.Depth
pandas provides a large set of summary functions that operate on B 2.0 F
Add single column. Join matching rows from adf to bdf.
different kinds of pandas objects (DataFrame columns, Series, D NaN T
pd.qcut(df.col, n, labels=False)
GroupBy, Expanding and Rolling (see below)) and produce single
Bin column into n buckets. x1 x2 x3
values for each of the groups. When applied to a DataFrame, the
pd.merge(adf, bdf,
result is returned as a pandas Series for each column. Examples: A 1 T
how='inner', on='x1')
sum() min() Vector Vector B 2 F
function function
Join data. Retain only rows in both sets.
Sum values of each object. Minimum value in each object.
count() max() x1 x2 x3 pd.merge(adf, bdf,
Count non-NA/null values of Maximum value in each object. A 1 T how='outer', on='x1')
pandas provides a large set of vector functions that operate on all
each object. mean() B 2 F Join data. Retain all values, all rows.
columns of a DataFrame or a single selected column (a pandas
median() Mean value of each object. Series). These functions produce vectors of values for each of the C 3 NaN
Median value of each object. var() columns, or a single Series for the individual Series. Examples: D NaN T
quantile([0.25,0.75]) Variance of each object.
max(axis=1) min(axis=1) Filtering Joins
Quantiles of each object. std() x1 x2
Element-wise max. Element-wise min. adf[adf.x1.isin(bdf.x1)]
apply(function) Standard deviation of each A 1
clip(lower=-10,upper=10) abs() All rows in adf that have a match in bdf.
Apply function to each object. object. B 2
Trim values at input thresholds Absolute value.
Group Data x1 x2 adf[~adf.x1.isin(bdf.x1)]
C 3 All rows in adf that do not have a match in bdf.
df.groupby(by="col") The examples below can also be applied to groups. In this case, the
Return a GroupBy object, grouped function is applied on a per-group basis, and the returned vectors
by values in column named "col". are of the length of the original DataFrame. ydf zdf
x1 x2 x1 x2
shift(1) shift(-1)
df.groupby(level="ind") A 1 B 2
Copy with values shifted by 1. Copy with values lagged by 1.
Return a GroupBy object, grouped B 2 C 3
rank(method='dense') cumsum()
by values in index level named C 3 D 4
Ranks with no gaps. Cumulative sum.
"ind". rank(method='min') cummax() Set-like Operations
Ranks. Ties get min rank. Cumulative max. x1 x2 pd.merge(ydf, zdf)
All of the summary functions listed above can be applied to a group. rank(pct=True) cummin() B 2 Rows that appear in both ydf and zdf
Additional GroupBy functions: Ranks rescaled to interval [0, 1]. Cumulative min. C 3 (Intersection).
size() agg(function) rank(method='first') cumprod()
Size of each group. Aggregate group using function. Ranks. Ties go to first value. Cumulative product. x1 x2
pd.merge(ydf, zdf, how='outer')
A 1
Rows that appear in either or both ydf and zdf
Windows Plotting B
C
2
3
(Union).
df.expanding() df.plot.scatter(x='w',y='h') D 4
df.plot.hist() pd.merge(ydf, zdf, how='outer',
Return an Expanding object allowing summary functions to be Histogram for each column Scatter chart using pairs of points indicator=True)
applied cumulatively. x1 x2
A 1 .query('_merge == "left_only"')
df.rolling(n) .drop(columns=['_merge'])
Return a Rolling object allowing summary functions to be Rows that appear in ydf but not zdf (Setdiff).
applied to windows of length n.
Cheatsheet for pandas (http://pandas.pydata.org/) originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet

You might also like