10 minutes to pandas

10 minutes to pandas
In [2]:
import numpy as np
import pandas as pd
Object creation
In [3]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

s
Out[3]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
In [4]:
dates = pd.date_range("20130101", periods=6)

print(dates)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

print(df)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',

'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
A B C D
2013-01-01 -0.420668 0.316927 1.194467 -0.180756
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-05 -0.351920 -2.822457 1.027199 0.879863
2013-01-06 0.339076 -2.810959 -0.367116 -0.950692
Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure
In [5]:
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
Out[5]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [6]:
df2.dtypes
Out[6]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Viewing data
In [7]:
print(df.head())
print(df.tail())
A B C D
2013-01-01 -0.420668 0.316927 1.194467 -0.180756
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-05 -0.351920 -2.822457 1.027199 0.879863
A B C D
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-05 -0.351920 -2.822457 1.027199 0.879863
2013-01-06 0.339076 -2.810959 -0.367116 -0.950692
Display the index, columns:
In [8]:
df.index
Out[8]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
Please note NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column,When you call
DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires
casting every value to a Python object.
In [9]:
df.to_numpy()
Out[9]:
array([[-0.42066784, 0.3169275 , 1.19446662, -0.18075587],
[ 0.79518934, -0.01233682, 1.62909892, 2.34265771],
[-1.97661037, -0.32151732, 0.54868407, 0.01674812],
[ 1.11908538, -1.30041906, -0.61900632, 1.94973826],
[-0.35191989, -2.82245689, 1.02719947, 0.87986305],
[ 0.33907597, -2.81095942, -0.36711631, -0.95069173]])
In [10]:
print(type(df.to_numpy()))
<class 'numpy.ndarray'>
In [11]:
df.dtypes
Out[11]:
A float64
B float64
C float64
D float64
dtype: object
For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive:
In [12]:
df2.to_numpy()
Out[12]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
DataFrame.to_numpy() does not include the index or column labels in the output.
In [13]:
df.describe()
Out[13]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.082641 -1.158460 0.568888 0.676260
std 1.110538 1.393602 0.895885 1.285028
min -1.976610 -2.822457 -0.619006 -0.950692
25% -0.403481 -2.433324 -0.138166 -0.131380
50% -0.006422 -0.810968 0.787942 0.448306
75% 0.681161 -0.089632 1.152650 1.682269
max 1.119085 0.316927 1.629099 2.342658
In [14]:
df.T
Out[14]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A -0.420668 0.795189 -1.976610 1.119085 -0.351920 0.339076
B 0.316927 -0.012337 -0.321517 -1.300419 -2.822457 -2.810959
C 1.194467 1.629099 0.548684 -0.619006 1.027199 -0.367116
D -0.180756 2.342658 0.016748 1.949738 0.879863 -0.950692
Sorting by an axis:
In [15]:
df.sort_index(axis=1, ascending=False)
Out[15]:
D C B A
2013-01-01 -0.180756 1.194467 0.316927 -0.420668
2013-01-02 2.342658 1.629099 -0.012337 0.795189
2013-01-03 0.016748 0.548684 -0.321517 -1.976610
2013-01-04 1.949738 -0.619006 -1.300419 1.119085
2013-01-05 0.879863 1.027199 -2.822457 -0.351920
2013-01-06 -0.950692 -0.367116 -2.810959 0.339076
Sorting by values:
In [16]:
df.sort_values(by="B")
Out[16]:
A B C D
2013-01-05 -0.351920 -2.822457 1.027199 0.879863
2013-01-06 0.339076 -2.810959 -0.367116 -0.950692
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-01 -0.420668 0.316927 1.194467 -0.180756
Selection
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we
recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.
Selecting a single column, which yields a Series, equivalent to df.A:
In [17]:
df["A"]
Out[17]:
2013-01-01 -0.420668
2013-01-02 0.795189
2013-01-03 -1.976610
2013-01-04 1.119085
2013-01-05 -0.351920
2013-01-06 0.339076
Freq: D, Name: A, dtype: float64
Selecting via [], which slices the rows:
In [18]:
df[0:3]
df["20130102":"20130104"]
Out[18]:
A B C D
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
Selection by label
In [19]:
df.loc[dates[0]]
Out[19]:
A -0.420668
B 0.316927
C 1.194467
D -0.180756
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
In [20]:
df.loc[:, ["A", "B"]]
Out[20]:
A B
2013-01-01 -0.420668 0.316927
2013-01-02 0.795189 -0.012337
2013-01-03 -1.976610 -0.321517
2013-01-04 1.119085 -1.300419
2013-01-05 -0.351920 -2.822457
2013-01-06 0.339076 -2.810959
Showing label slicing, both endpoints are included:

In [21]:
df.loc["20130102":"20130104", ["A", "B"]]
Out[21]:
A B
2013-01-02 0.795189 -0.012337
2013-01-03 -1.976610 -0.321517
2013-01-04 1.119085 -1.300419
Reduction in the dimensions of the returned object:

In [22]:
df.loc["20130102", ["A", "B"]]
Out[22]:
A 0.795189
B -0.012337
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:
In [23]:
df.loc[dates[0], "A"]
Out[23]:
-0.4206678363713324
In [24]:
df.at[dates[0], "A"]
Out[24]:
-0.4206678363713324
Selection by position
In [25]:
df.iloc[3]
Out[25]:
A 1.119085
B -1.300419
C -0.619006
D 1.949738
Name: 2013-01-04 00:00:00, dtype: float64
In [26]:
df.iloc[3:5, 0:2]
Out[26]:
A B
2013-01-04 1.119085 -1.300419
2013-01-05 -0.351920 -2.822457
In [27]:
df.iloc[[1, 2, 4], [0, 2]]
Out[27]:
A C
2013-01-02 0.795189 1.629099
2013-01-03 -1.976610 0.548684
2013-01-05 -0.351920 1.027199
For slicing rows explicitly:

In [28]:
df.iloc[1:3, :]
Out[28]:
A B C D
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
For slicing columns explicitly:

In [29]:
df.iloc[:, 1:3]
Out[29]:
B C
2013-01-01 0.316927 1.194467
2013-01-02 -0.012337 1.629099
2013-01-03 -0.321517 0.548684
2013-01-04 -1.300419 -0.619006
2013-01-05 -2.822457 1.027199
2013-01-06 -2.810959 -0.367116
For getting a value explicitly:

In [30]:
df.iloc[1, 1]
Out[30]:
-0.012336819546293868
In [31]:
df.iat[1, 1]
Out[31]:
-0.012336819546293868
Boolean indexing
In [32]:
df[df["A"] > 0]
Out[32]:
A B C D
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-06 0.339076 -2.810959 -0.367116 -0.950692
Selecting values from a DataFrame where a boolean condition is met:

In [33]:
df[df > 0]
Out[33]:
A B C D
2013-01-01 NaN 0.316927 1.194467 NaN
2013-01-02 0.795189 NaN 1.629099 2.342658
2013-01-03 NaN NaN 0.548684 0.016748
2013-01-04 1.119085 NaN NaN 1.949738
2013-01-05 NaN NaN 1.027199 0.879863
2013-01-06 0.339076 NaN NaN NaN
Using the isin() method for filtering:

In [34]:
df2 = df.copy()
In [35]:
df2["E"] = ["one", "one", "two", "three", "four", "three"]
In [36]:
df2
Out[36]:
A B C D E
2013-01-01 -0.420668 0.316927 1.194467 -0.180756 one
2013-01-02 0.795189 -0.012337 1.629099 2.342658 one
2013-01-03 -1.976610 -0.321517 0.548684 0.016748 two
2013-01-04 1.119085 -1.300419 -0.619006 1.949738 three
2013-01-05 -0.351920 -2.822457 1.027199 0.879863 four
2013-01-06 0.339076 -2.810959 -0.367116 -0.950692 three
In [37]:
df2[df2["E"].isin(["two", "four"])]
Out[37]:
A B C D E
2013-01-03 -1.97661 -0.321517 0.548684 0.016748 two
2013-01-05 -0.35192 -2.822457 1.027199 0.879863 four
Setting
Setting a new column automatically aligns the data by the indexes:

In [38]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
In [39]:
s1
Out[39]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [40]:
df["F"] = s1
Setting values by label:

In [41]:
df.at[dates[0], "A"] = 0
Setting values by position:

In [42]:
df.iat[0, 1] = 0
Setting by assigning with a NumPy array:

In [43]:
df.loc[:, "D"] = np.array([5] * len(df))
The result of the prior setting operations:

In [44]:
df
Out[44]:
A B C D F
2013-01-01 0.000000 0.000000 1.194467 5 NaN
2013-01-02 0.795189 -0.012337 1.629099 5 1.0
2013-01-03 -1.976610 -0.321517 0.548684 5 2.0
2013-01-04 1.119085 -1.300419 -0.619006 5 3.0
2013-01-05 -0.351920 -2.822457 1.027199 5 4.0
2013-01-06 0.339076 -2.810959 -0.367116 5 5.0
A where operation with setting:

In [45]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2
Out[45]:
A B C D F
2013-01-01 0.000000 0.000000 -1.194467 -5 NaN
2013-01-02 -0.795189 -0.012337 -1.629099 -5 -1.0
2013-01-03 -1.976610 -0.321517 -0.548684 -5 -2.0
2013-01-04 -1.119085 -1.300419 -0.619006 -5 -3.0
2013-01-05 -0.351920 -2.822457 -1.027199 -5 -4.0
2013-01-06 -0.339076 -2.810959 -0.367116 -5 -5.0
Missing data
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
In [46]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1
Out[46]:
A B C D F E
2013-01-01 0.000000 0.000000 1.194467 5 NaN 1.0
2013-01-02 0.795189 -0.012337 1.629099 5 1.0 1.0
2013-01-03 -1.976610 -0.321517 0.548684 5 2.0 NaN
2013-01-04 1.119085 -1.300419 -0.619006 5 3.0 NaN
To drop any rows that have missing data:

In [47]:
df1.dropna(how="any")
Out[47]:
A B C D F E
2013-01-02 0.795189 -0.012337 1.629099 5 1.0 1.0
Filling missing data:

In [48]:
df1.fillna(value=5)
Out[48]:
A B C D F E
2013-01-01 0.000000 0.000000 1.194467 5 5.0 1.0
2013-01-02 0.795189 -0.012337 1.629099 5 1.0 1.0
2013-01-03 -1.976610 -0.321517 0.548684 5 2.0 5.0
2013-01-04 1.119085 -1.300419 -0.619006 5 3.0 5.0
To get the boolean mask where values are nan:

In [49]:
pd.isna(df1)
Out[49]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
Statsistics
In [50]:
df.mean()
Out[50]:
A -0.012530
B -1.211282
C 0.568888
D 5.000000
F 3.000000
dtype: float64
Same operation on the other axis:
In [51]:
df.mean(1)
Out[51]:
2013-01-01 1.548617
2013-01-02 1.682390
2013-01-03 1.050111
2013-01-04 1.439932
2013-01-05 1.370565
2013-01-06 1.432200
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:
In [52]:
pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

Out[52]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
In [53]:
Out[53]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
In [54]:
df.sub(s, axis="index")
Out[54]:
A B C D F
2013-01-01 00:00:00 NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN
Apply
Applying functions to the data:

In [55]:
df.apply(np.cumsum)
Out[55]:
A B C D F
2013-01-01 0.000000 0.000000 1.194467 5 NaN
2013-01-02 0.795189 -0.012337 2.823566 10 1.0
2013-01-03 -1.181421 -0.333854 3.372250 15 3.0
2013-01-04 -0.062336 -1.634273 2.753243 20 6.0
2013-01-05 -0.414256 -4.456730 3.780443 25 10.0
2013-01-06 -0.075180 -7.267690 3.413326 30 15.0
In [56]:
df.apply(lambda x: x.max() - x.min())
Out[56]:
A 3.095696
B 2.822457
C 2.248105
D 0.000000
F 4.000000
dtype: float64
Histogramming
In [57]:
s = pd.Series(np.random.randint(0, 7, size=10))
s
Out[57]:
0 3
1 4
2 2
3 5
4 5
5 0
6 1
7 4
8 6
9 0
dtype: int32
In [58]:
s.value_counts()
Out[58]:
4 2
5 2
0 2
3 1
2 1
1 1
6 1
dtype: int64
String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code
snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at
Vectorized String Methods.
In [59]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()
Out[59]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge
Concat - pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and
relational algebra functionality in the case of join / merge-type operations.
In [60]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[60]:
0 1 2 3
0 -1.222458 -0.545523 0.945071 -0.064294
1 1.067093 -0.205692 -1.292987 -0.557581
2 -0.750368 -0.968887 -1.056839 -0.006117
3 0.735581 2.073268 -1.470256 -1.164244
4 0.829276 -0.450463 0.378255 -0.953423
5 0.006803 -1.675964 1.571826 0.713458
6 -0.473197 -0.835939 -1.590717 -1.186727
7 2.363942 -0.873282 0.069140 0.487221
8 0.133559 0.098014 0.196791 1.068511
9 -0.986957 0.610417 1.406884 -0.407003
In [61]:
# break it into pieces

pieces = [df[:3], df[3:7], df[7:]]
In [62]:
pd.concat(pieces)
Out[62]:
0 1 2 3
0 -1.222458 -0.545523 0.945071 -0.064294
1 1.067093 -0.205692 -1.292987 -0.557581
2 -0.750368 -0.968887 -1.056839 -0.006117
3 0.735581 2.073268 -1.470256 -1.164244
4 0.829276 -0.450463 0.378255 -0.953423
5 0.006803 -1.675964 1.571826 0.713458
6 -0.473197 -0.835939 -1.590717 -1.186727
7 2.363942 -0.873282 0.069140 0.487221
8 0.133559 0.098014 0.196791 1.068511
9 -0.986957 0.610417 1.406884 -0.407003
Join
SQL style merges

In [63]:
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
print(left)
print(right)
pd.merge(left, right, on="key")
key lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 foo 5
Out[63]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
In [64]:
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
print(left)
print(right)
pd.merge(left, right, on="key")
key lval
0 foo 1
1 bar 2
key rval
0 foo 4
1 bar 5
Out[64]:
key lval rval
0 foo 1 4
1 bar 2 5
Grouping
By “group by” we are referring to a process involving one or more of the following steps:
** Splitting the data into groups based on some criteria
** Applying a function to each group independently
** Combining the results into a data structure

In [65]:
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df
Out[65]:
A B C D
0 foo one -1.502476 -1.053789
1 bar one -1.646625 -1.614114
2 foo two 0.429229 0.624235
3 bar three 0.818017 0.132200
4 foo two 0.347825 2.179355
5 bar two -0.066571 -2.371830
6 foo one 0.720629 0.134159
7 foo three 0.141101 -0.442835
Grouping and then applying the sum() function to the resulting groups:
In [66]:
df.groupby("A").sum()
Out[66]:
C D
bar -0.895180 -3.853745
foo 0.136309 1.441124
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function
In [67]:
df.groupby(["A", "B"]).sum()
Out[67]:
C D
A B
bar one -1.646625 -1.614114
three 0.818017 0.132200
two -0.066571 -2.371830
foo one -0.781846 -0.919630
three 0.141101 -0.442835
two 0.777054 2.803589
Reshaping
Stack
In [68]:
tuples = list(
zip(
*[
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
)
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2 = df[:4]
df2
Out[68]:
A B
first second
bar one -1.796732 0.227643
two -0.567493 0.489221
baz one -0.629730 -0.753370
two 0.610537 -1.857430
The stack() method “compresses” a level in the DataFrame’s columns:

In [69]:
stacked = df2.stack()
stacked
Out[69]:
first second
bar one A -1.796732
B 0.227643
two A -0.567493
B 0.489221
baz one A -0.629730
B -0.753370
two A 0.610537
B -1.857430
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last
level:
In [70]:
stacked.unstack()
Out[70]:
A B
first second
bar one -1.796732 0.227643
two -0.567493 0.489221
baz one -0.629730 -0.753370
two 0.610537 -1.857430
In [71]:
stacked.unstack(1)
Out[71]:
second one two
first
bar A -1.796732 -0.567493
B 0.227643 0.489221
baz A -0.629730 0.610537
B -0.753370 -1.857430
In [72]:
stacked.unstack(0)
Out[72]:
first bar baz
second
one A -1.796732 -0.629730
B 0.227643 -0.753370
two A -0.567493 0.610537
B 0.489221 -1.857430
Pivot tables
In [73]:
df = pd.DataFrame(
{
"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12),
}
)
df
Out[73]:
A B C D E
0 one A foo 0.206741 0.188787
1 one B foo 1.549564 -2.062565
2 two C foo 0.961319 -0.898992
3 three A bar 0.893095 -0.853662
4 one B bar -0.421078 -0.615559
5 one C bar 1.450286 -1.161786
6 two A foo -1.113889 -1.422208
7 three B foo 0.055514 -1.526936
8 one C foo -0.555518 -2.069541
9 one A bar 0.009159 0.632686
10 two B bar -0.588536 0.604000
11 three C bar 0.057836 -1.128347
We can produce pivot tables from this data very easily:

In [74]:
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[74]:
C bar foo
A B
one A 0.009159 0.206741
B -0.421078 1.549564
C 1.450286 -0.555518
three A 0.893095 NaN
B NaN 0.055514
C 0.057836 NaN
two A NaN -1.113889
B -0.588536 NaN
C NaN 0.961319
Time series
In [75]:
rng = pd.date_range("1/1/2012", periods=100, freq="S")
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample("5Min").sum()
Out[75]:
2012-01-01 25087
Freq: 5T, dtype: int32
Time zone representation:
In [76]:
rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
Out[76]:
2012-03-06 -0.030217
2012-03-07 -0.988525
2012-03-08 -0.519949
2012-03-09 -0.102984
2012-03-10 0.412800
In [77]:
ts_utc = ts.tz_localize("UTC")
ts_utc
Out[77]:
2012-03-06 00:00:00+00:00 -0.030217
2012-03-07 00:00:00+00:00 -0.988525
2012-03-08 00:00:00+00:00 -0.519949
2012-03-09 00:00:00+00:00 -0.102984
2012-03-10 00:00:00+00:00 0.412800
Converting to another time zone:
In [78]:
ts_utc.tz_convert("US/Eastern")
Out[78]:
2012-03-05 19:00:00-05:00 -0.030217
2012-03-06 19:00:00-05:00 -0.988525
2012-03-07 19:00:00-05:00 -0.519949
2012-03-08 19:00:00-05:00 -0.102984
2012-03-09 19:00:00-05:00 0.412800
Converting between time span representations:
In [79]:
rng = pd.date_range("1/1/2012", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
Out[79]:
2012-01-31 -0.840996
2012-02-29 -0.792790
2012-03-31 -0.000829
2012-04-30 0.366409
2012-05-31 0.021293
Freq: M, dtype: float64
In [80]:
ps = ts.to_period()
ps
Out[80]:
2012-01 -0.840996
2012-02 -0.792790
2012-03 -0.000829
2012-04 0.366409
2012-05 0.021293
Freq: M, dtype: float64
In [81]:
ps.to_timestamp()
Out[81]:
2012-01-01 -0.840996
2012-02-01 -0.792790
2012-03-01 -0.000829
2012-04-01 0.366409
2012-05-01 0.021293
Freq: MS, dtype: float64
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly
frequency with year ending in November to 9am of the end of the month following the quarter end:
In [82]:
prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
ts.head()
Out[82]:
1990-03-01 09:00 0.284210
1990-06-01 09:00 -0.380056
1990-09-01 09:00 -1.117618
1990-12-01 09:00 -0.011959
1991-03-01 09:00 1.153291
Freq: H, dtype: float64
Categoricals - pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
In [83]:
df = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
In [84]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Out[84]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!):
In [85]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):
In [86]:
df["grade"] = df["grade"].cat.set_categories(
["very bad", "bad", "medium", "good", "very good"]
)
df["grade"]
Out[86]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
Sorting is per order in the categories, not lexical order:
In [87]:
df.sort_values(by="grade")
Out[87]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
In [88]:
df.groupby("grade").size()
Out[88]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
Plotting from pandas data
In [89]:
import matplotlib.pyplot as plt

plt.close("all")
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
ts.plot();
plt.show();
In [90]:
df = pd.DataFrame(
np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)
df = df.cumsum()
plt.figure();
df.plot();
plt.legend(loc='best');
<Figure size 432x288 with 0 Axes>
data in/out
Writing to a csv file:
In [91]:
df.to_csv("foo.csv")
Reading from a csv file:

In [92]:
pd.read_csv("foo.csv")
Out[92]:
Unnamed: 0 A B C D
0 2000-01-01 -1.109880 1.740198 0.560462 -0.778221
1 2000-01-02 -0.789209 2.199018 1.626010 -1.086583
2 2000-01-03 -0.655827 3.834424 2.362702 0.197711
3 2000-01-04 -1.490387 4.339847 0.843494 -0.220284
4 2000-01-05 -2.311777 4.432653 -0.150564 -1.236404
... ... ... ... ... ...
995 2002-09-22 18.024293 -4.195810 5.507859 1.509948
996 2002-09-23 18.197178 -2.839899 5.018599 1.588831
997 2002-09-24 18.794616 -4.743130 6.943454 1.508593
998 2002-09-25 19.424756 -4.544472 5.574543 2.531804
999 2002-09-26 21.661008 -4.871010 5.199765 3.945830
1000 rows × 5 columns
HDF5- Reading and writing to HDFStores.

Writing to a HDF5 Store:
In [93]:
df.to_hdf("foo.h5", "df")
In [94]:
pd.read_hdf("foo.h5", "df")
Out[94]:
A B C D
2000-01-01 -1.109880 1.740198 0.560462 -0.778221
2000-01-02 -0.789209 2.199018 1.626010 -1.086583
2000-01-03 -0.655827 3.834424 2.362702 0.197711
2000-01-04 -1.490387 4.339847 0.843494 -0.220284
2000-01-05 -2.311777 4.432653 -0.150564 -1.236404
... ... ... ... ...
2002-09-22 18.024293 -4.195810 5.507859 1.509948
2002-09-23 18.197178 -2.839899 5.018599 1.588831
2002-09-24 18.794616 -4.743130 6.943454 1.508593
2002-09-25 19.424756 -4.544472 5.574543 2.531804
2002-09-26 21.661008 -4.871010 5.199765 3.945830
Excel- Reading and writing to MS Excel.

In [95]:
df.to_excel("foo.xlsx", sheet_name="Sheet1")
In [96]:
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
Out[96]:
Unnamed: 0 A B C D
0 2000-01-01 -1.109880 1.740198 0.560462 -0.778221
1 2000-01-02 -0.789209 2.199018 1.626010 -1.086583
2 2000-01-03 -0.655827 3.834424 2.362702 0.197711
3 2000-01-04 -1.490387 4.339847 0.843494 -0.220284
4 2000-01-05 -2.311777 4.432653 -0.150564 -1.236404
... ... ... ... ... ...
995 2002-09-22 18.024293 -4.195810 5.507859 1.509948
996 2002-09-23 18.197178 -2.839899 5.018599 1.588831
997 2002-09-24 18.794616 -4.743130 6.943454 1.508593
998 2002-09-25 19.424756 -4.544472 5.574543 2.531804
999 2002-09-26 21.661008 -4.871010 5.199765 3.945830
In [ ]:

10 minutes to pandas

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 minutes to pandas

Uploaded by

Copyright:

Available Formats

10 minutes to pandas

s = pd.Series([1, 3, 5, np.nan, 6, 8])

dates = pd.date_range("20130101", periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',

0 1.0 2013-01-02 1.0 3 test foo

1 1.0 2013-01-02 1.0 3 train foo

2 1.0 2013-01-02 1.0 3 test foo

3 1.0 2013-01-02 1.0 3 train foo

count 6.000000 6.000000 6.000000 6.000000

mean -0.082641 -1.158460 0.568888 0.676260

std 1.110538 1.393602 0.895885 1.285028

min -1.976610 -2.822457 -0.619006 -0.950692

25% -0.403481 -2.433324 -0.138166 -0.131380

50% -0.006422 -0.810968 0.787942 0.448306

75% 0.681161 -0.089632 1.152650 1.682269

max 1.119085 0.316927 1.629099 2.342658

A -0.420668 0.795189 -1.976610 1.119085 -0.351920 0.339076

B 0.316927 -0.012337 -0.321517 -1.300419 -2.822457 -2.810959

C 1.194467 1.629099 0.548684 -0.619006 1.027199 -0.367116

D -0.180756 2.342658 0.016748 1.949738 0.879863 -0.950692

2013-01-01 -0.180756 1.194467 0.316927 -0.420668

2013-01-02 2.342658 1.629099 -0.012337 0.795189

2013-01-03 0.016748 0.548684 -0.321517 -1.976610

2013-01-04 1.949738 -0.619006 -1.300419 1.119085

2013-01-05 0.879863 1.027199 -2.822457 -0.351920

2013-01-06 -0.950692 -0.367116 -2.810959 0.339076

2013-01-05 -0.351920 -2.822457 1.027199 0.879863

2013-01-06 0.339076 -2.810959 -0.367116 -0.950692

2013-01-04 1.119085 -1.300419 -0.619006 1.949738

2013-01-03 -1.976610 -0.321517 0.548684 0.016748

2013-01-02 0.795189 -0.012337 1.629099 2.342658

2013-01-01 -0.420668 0.316927 1.194467 -0.180756

2013-01-02 0.795189 -0.012337 1.629099 2.342658

2013-01-03 -1.976610 -0.321517 0.548684 0.016748

2013-01-04 1.119085 -1.300419 -0.619006 1.949738

df.loc[:, ["A", "B"]]

2013-01-01 -0.420668 0.316927

2013-01-02 0.795189 -0.012337

2013-01-03 -1.976610 -0.321517

2013-01-04 1.119085 -1.300419

2013-01-05 -0.351920 -2.822457

2013-01-06 0.339076 -2.810959

Showing label slicing, both endpoints are included:

df.loc["20130102":"20130104", ["A", "B"]]

2013-01-02 0.795189 -0.012337

2013-01-03 -1.976610 -0.321517

2013-01-04 1.119085 -1.300419

Reduction in the dimensions of the returned object:

df.loc["20130102", ["A", "B"]]

2013-01-04 1.119085 -1.300419

2013-01-05 -0.351920 -2.822457

df.iloc[[1, 2, 4], [0, 2]]

2013-01-02 0.795189 1.629099

2013-01-03 -1.976610 0.548684

2013-01-05 -0.351920 1.027199

For slicing rows explicitly:

2013-01-02 0.795189 -0.012337 1.629099 2.342658

2013-01-03 -1.976610 -0.321517 0.548684 0.016748

For slicing columns explicitly:

2013-01-01 0.316927 1.194467

2013-01-02 -0.012337 1.629099