You are on page 1of 19

10 minutes to pandas

In [2]:

import numpy as np
import pandas as pd

Object creation
In [3]:

s = pd.Series([1, 3, 5, np.nan, 6, 8])


s

Out[3]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
In [4]:

dates = pd.date_range("20130101", periods=6)


print(dates)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))


print(df)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',


'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
A B C D
2013-01-01 -0.420668 0.316927 1.194467 -0.180756
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-05 -0.351920 -2.822457 1.027199 0.879863
2013-01-06 0.339076 -2.810959 -0.367116 -0.950692
Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure
In [5]:

df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2

Out[5]:
A B C D E F

0 1.0 2013-01-02 1.0 3 test foo

1 1.0 2013-01-02 1.0 3 train foo

2 1.0 2013-01-02 1.0 3 test foo

3 1.0 2013-01-02 1.0 3 train foo

In [6]:

df2.dtypes

Out[6]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Viewing data
In [7]:
print(df.head())
print(df.tail())

A B C D
2013-01-01 -0.420668 0.316927 1.194467 -0.180756
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-05 -0.351920 -2.822457 1.027199 0.879863
A B C D
2013-01-02 0.795189 -0.012337 1.629099 2.342658
2013-01-03 -1.976610 -0.321517 0.548684 0.016748
2013-01-04 1.119085 -1.300419 -0.619006 1.949738
2013-01-05 -0.351920 -2.822457 1.027199 0.879863
2013-01-06 0.339076 -2.810959 -0.367116 -0.950692
Display the index, columns:
In [8]:

df.index

Out[8]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
Please note NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column,When you call
DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires
casting every value to a Python object.
In [9]:

df.to_numpy()

Out[9]:
array([[-0.42066784, 0.3169275 , 1.19446662, -0.18075587],
[ 0.79518934, -0.01233682, 1.62909892, 2.34265771],
[-1.97661037, -0.32151732, 0.54868407, 0.01674812],
[ 1.11908538, -1.30041906, -0.61900632, 1.94973826],
[-0.35191989, -2.82245689, 1.02719947, 0.87986305],
[ 0.33907597, -2.81095942, -0.36711631, -0.95069173]])
In [10]:

print(type(df.to_numpy()))

<class 'numpy.ndarray'>
In [11]:

df.dtypes

Out[11]:
A float64
B float64
C float64
D float64
dtype: object
For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive:
In [12]:

df2.to_numpy()

Out[12]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
DataFrame.to_numpy() does not include the index or column labels in the output.
In [13]:

df.describe()
Out[13]:
A B C D

count 6.000000 6.000000 6.000000 6.000000

mean -0.082641 -1.158460 0.568888 0.676260

std 1.110538 1.393602 0.895885 1.285028

min -1.976610 -2.822457 -0.619006 -0.950692

25% -0.403481 -2.433324 -0.138166 -0.131380

50% -0.006422 -0.810968 0.787942 0.448306

75% 0.681161 -0.089632 1.152650 1.682269

max 1.119085 0.316927 1.629099 2.342658

In [14]:

df.T

Out[14]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06

A -0.420668 0.795189 -1.976610 1.119085 -0.351920 0.339076

B 0.316927 -0.012337 -0.321517 -1.300419 -2.822457 -2.810959

C 1.194467 1.629099 0.548684 -0.619006 1.027199 -0.367116

D -0.180756 2.342658 0.016748 1.949738 0.879863 -0.950692

Sorting by an axis:
In [15]:

df.sort_index(axis=1, ascending=False)

Out[15]:
D C B A

2013-01-01 -0.180756 1.194467 0.316927 -0.420668

2013-01-02 2.342658 1.629099 -0.012337 0.795189

2013-01-03 0.016748 0.548684 -0.321517 -1.976610

2013-01-04 1.949738 -0.619006 -1.300419 1.119085

2013-01-05 0.879863 1.027199 -2.822457 -0.351920

2013-01-06 -0.950692 -0.367116 -2.810959 0.339076

Sorting by values:
In [16]:

df.sort_values(by="B")

Out[16]:
A B C D

2013-01-05 -0.351920 -2.822457 1.027199 0.879863

2013-01-06 0.339076 -2.810959 -0.367116 -0.950692

2013-01-04 1.119085 -1.300419 -0.619006 1.949738

2013-01-03 -1.976610 -0.321517 0.548684 0.016748

2013-01-02 0.795189 -0.012337 1.629099 2.342658

2013-01-01 -0.420668 0.316927 1.194467 -0.180756

Selection

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we
recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.
Selecting a single column, which yields a Series, equivalent to df.A:
In [17]:

df["A"]
Out[17]:
2013-01-01 -0.420668
2013-01-02 0.795189
2013-01-03 -1.976610
2013-01-04 1.119085
2013-01-05 -0.351920
2013-01-06 0.339076
Freq: D, Name: A, dtype: float64
Selecting via [], which slices the rows:
In [18]:

df[0:3]

df["20130102":"20130104"]

Out[18]:
A B C D

2013-01-02 0.795189 -0.012337 1.629099 2.342658

2013-01-03 -1.976610 -0.321517 0.548684 0.016748

2013-01-04 1.119085 -1.300419 -0.619006 1.949738

Selection by label
In [19]:

df.loc[dates[0]]

Out[19]:
A -0.420668
B 0.316927
C 1.194467
D -0.180756
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
In [20]:

df.loc[:, ["A", "B"]]

Out[20]:
A B

2013-01-01 -0.420668 0.316927

2013-01-02 0.795189 -0.012337

2013-01-03 -1.976610 -0.321517

2013-01-04 1.119085 -1.300419

2013-01-05 -0.351920 -2.822457

2013-01-06 0.339076 -2.810959

Showing label slicing, both endpoints are included:


In [21]:

df.loc["20130102":"20130104", ["A", "B"]]

Out[21]:
A B

2013-01-02 0.795189 -0.012337

2013-01-03 -1.976610 -0.321517

2013-01-04 1.119085 -1.300419

Reduction in the dimensions of the returned object:


In [22]:

df.loc["20130102", ["A", "B"]]

Out[22]:
A 0.795189
B -0.012337
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:
In [23]:

df.loc[dates[0], "A"]
Out[23]:
-0.4206678363713324
In [24]:

df.at[dates[0], "A"]

Out[24]:
-0.4206678363713324
Selection by position
In [25]:

df.iloc[3]

Out[25]:
A 1.119085
B -1.300419
C -0.619006
D 1.949738
Name: 2013-01-04 00:00:00, dtype: float64
In [26]:

df.iloc[3:5, 0:2]

Out[26]:
A B

2013-01-04 1.119085 -1.300419

2013-01-05 -0.351920 -2.822457

In [27]:

df.iloc[[1, 2, 4], [0, 2]]

Out[27]:
A C

2013-01-02 0.795189 1.629099

2013-01-03 -1.976610 0.548684

2013-01-05 -0.351920 1.027199

For slicing rows explicitly:


In [28]:

df.iloc[1:3, :]

Out[28]:
A B C D

2013-01-02 0.795189 -0.012337 1.629099 2.342658

2013-01-03 -1.976610 -0.321517 0.548684 0.016748

For slicing columns explicitly:


In [29]:

df.iloc[:, 1:3]

Out[29]:
B C

2013-01-01 0.316927 1.194467

2013-01-02 -0.012337 1.629099

2013-01-03 -0.321517 0.548684

2013-01-04 -1.300419 -0.619006

2013-01-05 -2.822457 1.027199

2013-01-06 -2.810959 -0.367116

For getting a value explicitly:


In [30]:

df.iloc[1, 1]

Out[30]:
-0.012336819546293868
In [31]:

df.iat[1, 1]
Out[31]:
-0.012336819546293868
Boolean indexing
In [32]:

df[df["A"] > 0]

Out[32]:
A B C D

2013-01-02 0.795189 -0.012337 1.629099 2.342658

2013-01-04 1.119085 -1.300419 -0.619006 1.949738

2013-01-06 0.339076 -2.810959 -0.367116 -0.950692

Selecting values from a DataFrame where a boolean condition is met:


In [33]:

df[df > 0]

Out[33]:
A B C D

2013-01-01 NaN 0.316927 1.194467 NaN

2013-01-02 0.795189 NaN 1.629099 2.342658

2013-01-03 NaN NaN 0.548684 0.016748

2013-01-04 1.119085 NaN NaN 1.949738

2013-01-05 NaN NaN 1.027199 0.879863

2013-01-06 0.339076 NaN NaN NaN

Using the isin() method for filtering:


In [34]:

df2 = df.copy()

In [35]:

df2["E"] = ["one", "one", "two", "three", "four", "three"]

In [36]:

df2

Out[36]:
A B C D E

2013-01-01 -0.420668 0.316927 1.194467 -0.180756 one

2013-01-02 0.795189 -0.012337 1.629099 2.342658 one

2013-01-03 -1.976610 -0.321517 0.548684 0.016748 two

2013-01-04 1.119085 -1.300419 -0.619006 1.949738 three

2013-01-05 -0.351920 -2.822457 1.027199 0.879863 four

2013-01-06 0.339076 -2.810959 -0.367116 -0.950692 three

In [37]:

df2[df2["E"].isin(["two", "four"])]

Out[37]:
A B C D E

2013-01-03 -1.97661 -0.321517 0.548684 0.016748 two

2013-01-05 -0.35192 -2.822457 1.027199 0.879863 four

Setting

Setting a new column automatically aligns the data by the indexes:


In [38]:

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

In [39]:

s1
Out[39]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [40]:

df["F"] = s1

Setting values by label:


In [41]:

df.at[dates[0], "A"] = 0

Setting values by position:


In [42]:

df.iat[0, 1] = 0

Setting by assigning with a NumPy array:


In [43]:

df.loc[:, "D"] = np.array([5] * len(df))

The result of the prior setting operations:


In [44]:

df

Out[44]:
A B C D F

2013-01-01 0.000000 0.000000 1.194467 5 NaN

2013-01-02 0.795189 -0.012337 1.629099 5 1.0

2013-01-03 -1.976610 -0.321517 0.548684 5 2.0

2013-01-04 1.119085 -1.300419 -0.619006 5 3.0

2013-01-05 -0.351920 -2.822457 1.027199 5 4.0

2013-01-06 0.339076 -2.810959 -0.367116 5 5.0

A where operation with setting:


In [45]:

df2 = df.copy()

df2[df2 > 0] = -df2

df2

Out[45]:
A B C D F

2013-01-01 0.000000 0.000000 -1.194467 -5 NaN

2013-01-02 -0.795189 -0.012337 -1.629099 -5 -1.0

2013-01-03 -1.976610 -0.321517 -0.548684 -5 -2.0

2013-01-04 -1.119085 -1.300419 -0.619006 -5 -3.0

2013-01-05 -0.351920 -2.822457 -1.027199 -5 -4.0

2013-01-06 -0.339076 -2.810959 -0.367116 -5 -5.0

Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
In [46]:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

df1.loc[dates[0] : dates[1], "E"] = 1

df1
Out[46]:
A B C D F E

2013-01-01 0.000000 0.000000 1.194467 5 NaN 1.0

2013-01-02 0.795189 -0.012337 1.629099 5 1.0 1.0

2013-01-03 -1.976610 -0.321517 0.548684 5 2.0 NaN

2013-01-04 1.119085 -1.300419 -0.619006 5 3.0 NaN

To drop any rows that have missing data:


In [47]:

df1.dropna(how="any")

Out[47]:
A B C D F E

2013-01-02 0.795189 -0.012337 1.629099 5 1.0 1.0

Filling missing data:


In [48]:

df1.fillna(value=5)

Out[48]:
A B C D F E

2013-01-01 0.000000 0.000000 1.194467 5 5.0 1.0

2013-01-02 0.795189 -0.012337 1.629099 5 1.0 1.0

2013-01-03 -1.976610 -0.321517 0.548684 5 2.0 5.0

2013-01-04 1.119085 -1.300419 -0.619006 5 3.0 5.0

To get the boolean mask where values are nan:


In [49]:

pd.isna(df1)

Out[49]:
A B C D F E

2013-01-01 False False False False True False

2013-01-02 False False False False False False

2013-01-03 False False False False False True

2013-01-04 False False False False False True

Statsistics
In [50]:

df.mean()

Out[50]:
A -0.012530
B -1.211282
C 0.568888
D 5.000000
F 3.000000
dtype: float64
Same operation on the other axis:
In [51]:

df.mean(1)

Out[51]:
2013-01-01 1.548617
2013-01-02 1.682390
2013-01-03 1.050111
2013-01-04 1.439932
2013-01-05 1.370565
2013-01-06 1.432200
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:
In [52]:

pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)


Out[52]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [53]:

Out[53]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
In [54]:

df.sub(s, axis="index")

Out[54]:
A B C D F

2013-01-01 00:00:00 NaN NaN NaN NaN NaN

2013-01-02 00:00:00 NaN NaN NaN NaN NaN

2013-01-03 00:00:00 NaN NaN NaN NaN NaN

2013-01-04 00:00:00 NaN NaN NaN NaN NaN

2013-01-05 00:00:00 NaN NaN NaN NaN NaN

2013-01-06 00:00:00 NaN NaN NaN NaN NaN

0 NaN NaN NaN NaN NaN

1 NaN NaN NaN NaN NaN

2 NaN NaN NaN NaN NaN

3 NaN NaN NaN NaN NaN

4 NaN NaN NaN NaN NaN

5 NaN NaN NaN NaN NaN

Apply

Applying functions to the data:


In [55]:

df.apply(np.cumsum)

Out[55]:
A B C D F

2013-01-01 0.000000 0.000000 1.194467 5 NaN

2013-01-02 0.795189 -0.012337 2.823566 10 1.0

2013-01-03 -1.181421 -0.333854 3.372250 15 3.0

2013-01-04 -0.062336 -1.634273 2.753243 20 6.0

2013-01-05 -0.414256 -4.456730 3.780443 25 10.0

2013-01-06 -0.075180 -7.267690 3.413326 30 15.0

In [56]:

df.apply(lambda x: x.max() - x.min())

Out[56]:
A 3.095696
B 2.822457
C 2.248105
D 0.000000
F 4.000000
dtype: float64
Histogramming
In [57]:

s = pd.Series(np.random.randint(0, 7, size=10))
s
Out[57]:
0 3
1 4
2 2
3 5
4 5
5 0
6 1
7 4
8 6
9 0
dtype: int32
In [58]:

s.value_counts()

Out[58]:
4 2
5 2
0 2
3 1
2 1
1 1
6 1
dtype: int64
String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code
snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at
Vectorized String Methods.
In [59]:

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

s.str.lower()

Out[59]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge
Concat - pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and
relational algebra functionality in the case of join / merge-type operations.
In [60]:

df = pd.DataFrame(np.random.randn(10, 4))

df

Out[60]:
0 1 2 3

0 -1.222458 -0.545523 0.945071 -0.064294

1 1.067093 -0.205692 -1.292987 -0.557581

2 -0.750368 -0.968887 -1.056839 -0.006117

3 0.735581 2.073268 -1.470256 -1.164244

4 0.829276 -0.450463 0.378255 -0.953423

5 0.006803 -1.675964 1.571826 0.713458

6 -0.473197 -0.835939 -1.590717 -1.186727

7 2.363942 -0.873282 0.069140 0.487221

8 0.133559 0.098014 0.196791 1.068511

9 -0.986957 0.610417 1.406884 -0.407003

In [61]:

# break it into pieces


pieces = [df[:3], df[3:7], df[7:]]

In [62]:
pd.concat(pieces)

Out[62]:
0 1 2 3

0 -1.222458 -0.545523 0.945071 -0.064294

1 1.067093 -0.205692 -1.292987 -0.557581

2 -0.750368 -0.968887 -1.056839 -0.006117

3 0.735581 2.073268 -1.470256 -1.164244

4 0.829276 -0.450463 0.378255 -0.953423

5 0.006803 -1.675964 1.571826 0.713458

6 -0.473197 -0.835939 -1.590717 -1.186727

7 2.363942 -0.873282 0.069140 0.487221

8 0.133559 0.098014 0.196791 1.068511

9 -0.986957 0.610417 1.406884 -0.407003

Join

SQL style merges


In [63]:

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

print(left)

print(right)

pd.merge(left, right, on="key")

key lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 foo 5
Out[63]:
key lval rval

0 foo 1 4

1 foo 1 5

2 foo 2 4

3 foo 2 5

In [64]:

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

print(left)
print(right)

pd.merge(left, right, on="key")

key lval
0 foo 1
1 bar 2
key rval
0 foo 4
1 bar 5
Out[64]:
key lval rval

0 foo 1 4

1 bar 2 5

Grouping
By “group by” we are referring to a process involving one or more of the following steps:

** Splitting the data into groups based on some criteria

** Applying a function to each group independently

** Combining the results into a data structure


In [65]:

df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)

df

Out[65]:
A B C D

0 foo one -1.502476 -1.053789

1 bar one -1.646625 -1.614114

2 foo two 0.429229 0.624235

3 bar three 0.818017 0.132200

4 foo two 0.347825 2.179355

5 bar two -0.066571 -2.371830

6 foo one 0.720629 0.134159

7 foo three 0.141101 -0.442835

Grouping and then applying the sum() function to the resulting groups:
In [66]:

df.groupby("A").sum()

Out[66]:
C D

bar -0.895180 -3.853745

foo 0.136309 1.441124

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function
In [67]:

df.groupby(["A", "B"]).sum()

Out[67]:
C D

A B

bar one -1.646625 -1.614114

three 0.818017 0.132200

two -0.066571 -2.371830

foo one -0.781846 -0.919630

three 0.141101 -0.442835

two 0.777054 2.803589

Reshaping
Stack
In [68]:

tuples = list(
zip(
*[
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
)
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

df2 = df[:4]

df2

Out[68]:
A B

first second

bar one -1.796732 0.227643

two -0.567493 0.489221

baz one -0.629730 -0.753370

two 0.610537 -1.857430

The stack() method “compresses” a level in the DataFrame’s columns:


In [69]:

stacked = df2.stack()
stacked

Out[69]:
first second
bar one A -1.796732
B 0.227643
two A -0.567493
B 0.489221
baz one A -0.629730
B -0.753370
two A 0.610537
B -1.857430
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last
level:
In [70]:

stacked.unstack()

Out[70]:
A B

first second

bar one -1.796732 0.227643

two -0.567493 0.489221

baz one -0.629730 -0.753370

two 0.610537 -1.857430

In [71]:

stacked.unstack(1)

Out[71]:
second one two

first

bar A -1.796732 -0.567493

B 0.227643 0.489221

baz A -0.629730 0.610537

B -0.753370 -1.857430

In [72]:

stacked.unstack(0)
Out[72]:
first bar baz

second

one A -1.796732 -0.629730

B 0.227643 -0.753370

two A -0.567493 0.610537

B 0.489221 -1.857430

Pivot tables
In [73]:

df = pd.DataFrame(
{
"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12),
}
)

df

Out[73]:
A B C D E

0 one A foo 0.206741 0.188787

1 one B foo 1.549564 -2.062565

2 two C foo 0.961319 -0.898992

3 three A bar 0.893095 -0.853662

4 one B bar -0.421078 -0.615559

5 one C bar 1.450286 -1.161786

6 two A foo -1.113889 -1.422208

7 three B foo 0.055514 -1.526936

8 one C foo -0.555518 -2.069541

9 one A bar 0.009159 0.632686

10 two B bar -0.588536 0.604000

11 three C bar 0.057836 -1.128347

We can produce pivot tables from this data very easily:


In [74]:

pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

Out[74]:
C bar foo

A B

one A 0.009159 0.206741

B -0.421078 1.549564

C 1.450286 -0.555518

three A 0.893095 NaN

B NaN 0.055514

C 0.057836 NaN

two A NaN -1.113889

B -0.588536 NaN

C NaN 0.961319

Time series
In [75]:

rng = pd.date_range("1/1/2012", periods=100, freq="S")

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample("5Min").sum()
Out[75]:
2012-01-01 25087
Freq: 5T, dtype: int32
Time zone representation:
In [76]:

rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")

ts = pd.Series(np.random.randn(len(rng)), rng)

ts

Out[76]:
2012-03-06 -0.030217
2012-03-07 -0.988525
2012-03-08 -0.519949
2012-03-09 -0.102984
2012-03-10 0.412800
Freq: D, dtype: float64
In [77]:

ts_utc = ts.tz_localize("UTC")

ts_utc

Out[77]:
2012-03-06 00:00:00+00:00 -0.030217
2012-03-07 00:00:00+00:00 -0.988525
2012-03-08 00:00:00+00:00 -0.519949
2012-03-09 00:00:00+00:00 -0.102984
2012-03-10 00:00:00+00:00 0.412800
Freq: D, dtype: float64
Converting to another time zone:
In [78]:

ts_utc.tz_convert("US/Eastern")

Out[78]:
2012-03-05 19:00:00-05:00 -0.030217
2012-03-06 19:00:00-05:00 -0.988525
2012-03-07 19:00:00-05:00 -0.519949
2012-03-08 19:00:00-05:00 -0.102984
2012-03-09 19:00:00-05:00 0.412800
Freq: D, dtype: float64
Converting between time span representations:
In [79]:

rng = pd.date_range("1/1/2012", periods=5, freq="M")

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts

Out[79]:
2012-01-31 -0.840996
2012-02-29 -0.792790
2012-03-31 -0.000829
2012-04-30 0.366409
2012-05-31 0.021293
Freq: M, dtype: float64
In [80]:

ps = ts.to_period()

ps

Out[80]:
2012-01 -0.840996
2012-02 -0.792790
2012-03 -0.000829
2012-04 0.366409
2012-05 0.021293
Freq: M, dtype: float64
In [81]:

ps.to_timestamp()
Out[81]:
2012-01-01 -0.840996
2012-02-01 -0.792790
2012-03-01 -0.000829
2012-04-01 0.366409
2012-05-01 0.021293
Freq: MS, dtype: float64
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly
frequency with year ending in November to 9am of the end of the month following the quarter end:
In [82]:

prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")

ts = pd.Series(np.random.randn(len(prng)), prng)

ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9

ts.head()

Out[82]:
1990-03-01 09:00 0.284210
1990-06-01 09:00 -0.380056
1990-09-01 09:00 -1.117618
1990-12-01 09:00 -0.011959
1991-03-01 09:00 1.153291
Freq: H, dtype: float64
Categoricals - pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
In [83]:

df = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)

In [84]:

df["grade"] = df["raw_grade"].astype("category")

df["grade"]

Out[84]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!):
In [85]:

df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):
In [86]:

df["grade"] = df["grade"].cat.set_categories(
["very bad", "bad", "medium", "good", "very good"]
)

df["grade"]

Out[86]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
Sorting is per order in the categories, not lexical order:
In [87]:

df.sort_values(by="grade")
Out[87]:
id raw_grade grade

5 6 e very bad

1 2 b good

2 3 b good

0 1 a very good

3 4 a very good

4 5 a very good

In [88]:

df.groupby("grade").size()

Out[88]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
Plotting from pandas data
In [89]:

import matplotlib.pyplot as plt


plt.close("all")

ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

ts = ts.cumsum()

ts.plot();

plt.show();

In [90]:

df = pd.DataFrame(
np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)

df = df.cumsum()

plt.figure();

df.plot();

plt.legend(loc='best');
<Figure size 432x288 with 0 Axes>

data in/out
Writing to a csv file:
In [91]:

df.to_csv("foo.csv")

Reading from a csv file:


In [92]:

pd.read_csv("foo.csv")

Out[92]:
Unnamed: 0 A B C D

0 2000-01-01 -1.109880 1.740198 0.560462 -0.778221

1 2000-01-02 -0.789209 2.199018 1.626010 -1.086583

2 2000-01-03 -0.655827 3.834424 2.362702 0.197711

3 2000-01-04 -1.490387 4.339847 0.843494 -0.220284

4 2000-01-05 -2.311777 4.432653 -0.150564 -1.236404

... ... ... ... ... ...

995 2002-09-22 18.024293 -4.195810 5.507859 1.509948

996 2002-09-23 18.197178 -2.839899 5.018599 1.588831

997 2002-09-24 18.794616 -4.743130 6.943454 1.508593

998 2002-09-25 19.424756 -4.544472 5.574543 2.531804

999 2002-09-26 21.661008 -4.871010 5.199765 3.945830

1000 rows × 5 columns

HDF5- Reading and writing to HDFStores.


Writing to a HDF5 Store:
In [93]:

df.to_hdf("foo.h5", "df")

In [94]:

pd.read_hdf("foo.h5", "df")
Out[94]:
A B C D

2000-01-01 -1.109880 1.740198 0.560462 -0.778221

2000-01-02 -0.789209 2.199018 1.626010 -1.086583

2000-01-03 -0.655827 3.834424 2.362702 0.197711

2000-01-04 -1.490387 4.339847 0.843494 -0.220284

2000-01-05 -2.311777 4.432653 -0.150564 -1.236404

... ... ... ... ...

2002-09-22 18.024293 -4.195810 5.507859 1.509948

2002-09-23 18.197178 -2.839899 5.018599 1.588831

2002-09-24 18.794616 -4.743130 6.943454 1.508593

2002-09-25 19.424756 -4.544472 5.574543 2.531804

2002-09-26 21.661008 -4.871010 5.199765 3.945830

1000 rows × 4 columns

Excel- Reading and writing to MS Excel.


In [95]:

df.to_excel("foo.xlsx", sheet_name="Sheet1")

In [96]:

pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])

Out[96]:
Unnamed: 0 A B C D

0 2000-01-01 -1.109880 1.740198 0.560462 -0.778221

1 2000-01-02 -0.789209 2.199018 1.626010 -1.086583

2 2000-01-03 -0.655827 3.834424 2.362702 0.197711

3 2000-01-04 -1.490387 4.339847 0.843494 -0.220284

4 2000-01-05 -2.311777 4.432653 -0.150564 -1.236404

... ... ... ... ... ...

995 2002-09-22 18.024293 -4.195810 5.507859 1.509948

996 2002-09-23 18.197178 -2.839899 5.018599 1.588831

997 2002-09-24 18.794616 -4.743130 6.943454 1.508593

998 2002-09-25 19.424756 -4.544472 5.574543 2.531804

999 2002-09-26 21.661008 -4.871010 5.199765 3.945830

1000 rows × 5 columns

In [ ]:

You might also like