Case 1 Blog

case 1 blog 4/1/21, 11:11 PM
What's our fair?

First, let's take a good look at our data.
In [1]: import numpy as np

import pandas as pd
from matplotlib import pyplot as plt
In [2]: plt.rcParams['figure.figsize'] = (8.0, 6.0)

plt.rc('font', size=16)
We have 10 years of historical data from 2010-2019 of electricity consumption and prices
In [3]: !ls case1/DATA
2010.csv 2012.csv 2014.csv 2016.csv 2018.csv

2011.csv 2013.csv 2015.csv 2017.csv 2019.csv
In [4]: data = [pd.read_csv(f'case1/DATA/{year}.csv', header=0, names=['date',

'cons', 'p']) for year in range(2010, 2020)]
In [5]: data[0].head()
Out[5]:
date cons p
0 2010-01-01 0.000000 0.0
1 2010-01-02 0.447285 2.0
2 2010-01-03 14.339777 2.0
3 2010-01-04 17.388527 2.0
4 2010-01-05 28.372350 2.0
https://tianyi.io/chicago1nb.html Page 1 of 23
case 1 blog 4/1/21, 11:11 PM
In [86]: plt.xlabel("Day")
plt.ylabel("Cumulative Consumption (KWh)")
plt.title("Consumption/day in 2010")
plt.plot(data[0].cons)
plt.show()
Price/day in 2010
The 4 contract expiry dates are plotted as vertical red lines.
case 1 blog 4/1/21, 11:11 PM
In [85]: plt.xlabel("Day")
plt.ylabel("Price ($)")
plt.ylim(0, 4.8)
plt.axvline(180, c="red")
plt.plot(data[0].p[1:])
plt.show()
case 1 blog 4/1/21, 11:11 PM
In [8]: plt.title("Daily consumption in 2010")

plt.xlabel("Daily consumption (KWh)")
plt.ylabel("Frequency")
plt.hist(data[0].cons[1:].values - data[0].cons[:-1].values, 50)
plt.show()
In [9]: # Convert cumulative consumption to daily

for d in data:
d['dailycons'] = d.cons - d.cons.shift(1)
Gas and coal supplies

All 10 years have very similar consumption/ production patterns, so we should be able to model the
electricity price accurately.
We will investigate the pattern of gas/coal consumption over the 10 years.
case 1 blog 4/1/21, 11:11 PM
In [10]: gas = []
for year in data:
row = year[(year.p < 3) & (year.p > 2)]
gas.append((row.cons - (3 - row.p) * row.dailycons).values[0])
plt.title("Gas capacity/Year")
plt.xlabel("Year")
plt.ylabel("Gas capacity (KWh)")
plt.plot(range(2010, 2020), gas)
plt.show()
case 1 blog 4/1/21, 11:11 PM
In [11]: coal = []
for i, year in enumerate(data):
row = year[(year.p < 3.01) & (year.p > 3)].iloc[0]
c = row.cons - (row.p - 3) * 1000 - gas[i]
coal.append(c)
plt.title("Coal capacity/Year")
plt.xlabel("Year")
plt.ylabel("Coal capacity (KWh)")
plt.plot(range(2010, 2020), coal)
plt.show()
It looks like there is a linear growth in natural gas and coal capacity over the years. Perhaps we can model
them with linear models.
In [12]: from sklearn.linear_model import LinearRegression
case 1 blog 4/1/21, 11:11 PM
In [90]: gasreg = LinearRegression().fit(np.arange(10).reshape(-1, 1), gas)

print("Gas production")
print("m:", gasreg.coef_[0], "c:", gasreg.intercept_)
print("2020 mean:", gasreg.predict([[10]])[0])
print("Unbiased estimate of variance:", np.var(gasreg.predict(np.arang
e(10).reshape(-1, 1)) - gas) * 10/9)
Gas production
m: 13.419965801206784 c: 1297.330695128393
2020 mean: 1431.5303531404609
Unbiased estimate of variance: 45.99518922123725
In [91]: coalreg = LinearRegression().fit(np.arange(10).reshape(-1, 1), coal)

print("Coal production")
print("m:", coalreg.coef_[0], "c:", coalreg.intercept_)
print("2020 mean:", coalreg.predict([[10]])[0])
print("Unbiased estimate of variance:", np.var(coalreg.predict(np.aran
ge(10).reshape(-1, 1)) - coal) * 10/9)
Coal production
m: 3.0905406916305362 c: 461.8418539140977
2020 mean: 492.7472608304031
Unbiased estimate of variance: 57.38547189035349
Modelling Consumption
Once we have a good idea of gas/coal capacity, we have to investigate consumption patterns. Specifically,
we want to answer the following questions:
What's the underlying distribution of consumption?

Is the consumption distribution constant or time-varying? (Periodic?)
Are there any autocorrelations in the consumption? (Perhaps, high consumption predicts high
consumption the next day?)
In [15]: # Combining consumption over 10 years

cons = pd.concat([year.dailycons[1:] for year in data]).reset_index().
drop('index', axis=1).values.T[0]
case 1 blog 4/1/21, 11:11 PM
In [16]: plt.title("Consumption/Day")
plt.xlabel("Day")
plt.ylabel("Consumption (KWh)")
plt.plot(cons)
plt.show()
case 1 blog 4/1/21, 11:11 PM
In [17]: plt.title("Histogram of consumption")

plt.xlabel("Consumption (KWh)")
plt.hist(cons, 100)
plt.show()
Hmm, does this look like an exponential distribution?
The exponential distribution has PDF of 𝜆𝑒−𝜆𝑥 , with mean of 1𝜆
In [18]: lmb = 1 / np.mean(cons)

print(lmb)
0.1252364948706893
case 1 blog 4/1/21, 11:11 PM
In [19]: plt.title("Histogram of consumption")

plt.xlabel("Consumption (KWh)")
plt.hist(cons, range(0, 72))
plt.plot(np.linspace(0, 70, 100), len(cons) * lmb * np.exp(-lmb * np.l
inspace(0, 70, 100)), label="Exp(lmb)")
plt.legend(loc="upper right")
plt.show()
In [60]: print("Sample variance:", np.var(cons))

print("Exp distribution variance:", 1/lmb**2)
Sample variance: 66.954336216657

Exp distribution variance: 63.758514788476
We have a strong suspicion that the underlying distribution is exponential. But we still have to confirm that
there aren't any periodic patterns in the data.
case 1 blog 4/1/21, 11:11 PM
Autocorrlelation between consecutive days.
The low correlation suggests lack of relationship between consecutive days.
In [20]: np.corrcoef(cons[1:], cons[:-1])
Out[20]: array([[ 1. , -0.02293147],

[-0.02293147, 1. ]])
Exponentially weighted moving averages

(https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) of consumption.
There doesn't seem to be a clear periodic trend in consumptions
In [31]: plt.title("EWMA of consumption, alpha = .01")

plt.xlabel("Day")
plt.ylabel("EWMA(cons)")
plt.plot(pd.Series(cons).ewm(alpha=.01).mean())
plt.show()
plt.title("EWMA of consumption, alpha = .005")
plt.xlabel("Day")
plt.ylabel("EWMA(cons)")
plt.plot(pd.Series(cons).ewm(alpha=.005).mean())
plt.show()
case 1 blog 4/1/21, 11:11 PM
case 1 blog 4/1/21, 11:11 PM
Correlations for n-day periods, n ∈ [1, 50)
Again, we fail to observe any interesting patterns
In [36]: for i in range(1, 50):

corr = np.corrcoef(cons[i:], cons[:-i])[0][1]
if abs(corr) > .03:
print(i, corr)
10 0.045732008518092615
14 -0.03654910372937772
19 -0.03024981752334226
24 -0.03749542551664027
43 0.03193238383270123
47 -0.031447572727001556
Putting it all together

Now that we're satisfied with modelling consumption as an exponential distribution, we can predict the price
of the futures contracts. We begin by looking at past contract prices.
In [67]: print("\t".join(["Year", "June", "Aug", "Oct", "Dec"]))

for year in data:
days = np.array([180, 242, 302, 364])
if int(year.loc[0].date[:4]) % 4 == 0: # Leap year
days += 1
print(year.loc[0].date[:4], *np.round(year.loc[days].p.values, 2),
sep="\t")
Year June Aug Oct Dec

2010 3.0 3.35 3.81 4.38
2011 3.0 3.12 3.58 4.01
2012 3.0 3.15 3.64 4.21
2013 3.0 3.2 3.72 4.23
2014 3.0 3.32 3.85 4.25
2015 3.0 3.05 3.44 3.8
2016 2.0 3.0 3.48 3.95
2017 3.0 3.09 3.55 3.96
2018 2.0 3.09 3.44 3.91
2019 3.0 3.15 3.54 4.02
case 1 blog 4/1/21, 11:11 PM
August, October and December contracts
We'll ignore the June contract for now (it's a little more mathematically complicated to price).
Historically, these 3 contracts expire only after natural gas capacity is used up, and therefore their prices
would be $3.00 + $.001 * amount of electricity usage exceeding coal capacity. Let N be the number of days
until the contract expires, G be the expected natural gas capacity (KWh), and C be the expected coal
capacity (KWh), E be the amount of electricity already consumed (KWh).
Then, the expected value of the contract is 3.00 + 𝑚𝑎𝑥(0, .001 × ( 1 × 𝑁 + 𝐸 − 𝐺 − 𝐶)), where 𝜆 =
𝜆
0.1252364948706893 was computed earlier. Here's our predictive formula in action for the December
contract in 2015.
In [201]: G = gasreg.predict(np.full((1,1), 5))[0]

C = coalreg.predict(np.full((1,1), 5))[0]
print(f'G = {G:.2f}, C = {C:.2f}')
data[5]["pred"] = 3.00 + .001 * (data[5].cons + 1/lmb * np.arange(364,
-1, -1) - G - C)
plt.ylim(3.1, 4.3)
plt.plot(data[5].pred, label="Predicted price")
plt.axhline(3.80, c="red", label="True price")
plt.plot(np.arange(75, 365), pd.Series(data[5].dailycons).ewm(alpha=.0
01).mean()[75:] / 2.3, label="EWMA(Consumption), scaled")
plt.legend(loc="lower left")
plt.xlabel("Day in 2015")
plt.ylabel("Price ($)")
plt.show()
case 1 blog 4/1/21, 11:11 PM
G = 1364.43, C = 477.29
As the diagram shows, the model overestimates the contract price initially, but approached the true value
closer to the contract expiry date. This is mainly because consumption decreased significantly over the later
half of the year, bringing down the final electricity price. This seemed good enough for our purposes.
A small improvement in G and C predictions
We used the mean value of the regression prediction for gas and coal capacity in the previous example, but
it could result in ridiculous situations where the predicted capacity is lower than the capacity we have
already observed. If we predicted 1300KWh of gas capacity but have already observed 1320KWh of gas
being used, then we should adjust our expectations for the amount of total gas capacity.
Since we assumed a normal distribution for gas capacity, N(𝜇, 𝜎 2 ), we can change it to a truncated normal
distribution (https://en.wikipedia.org/wiki/Truncated_normal_distribution) by cutting off the impossible lower
tail. This is basically combining a normal prior with an uniform posterior starting from the observed capacity.
case 1 blog 4/1/21, 11:11 PM
The June contract
The June contract is different from the other contracts. Historically, it either settled to $2 or $3, depending on
whether natural gas capacity was exhausted by expiry, with 2 of 10 cases expiring at $2. Our previous
approach would not be able to accurately price the June contract at all!
There are 2 ways around this: using the Central Limit Theorem (CLT) to assume normality in the distribution
of consumption, or to use the Erlang distribution to compute the probability exactly using marginalization
and integration. We went with the latter, but I'll show both methods and the differences in their predictions.
Assuming normality using CLT
From 10 years of consumption data, we have mean consumption of 7.98 and variance of 66.95. If there are
N days till the expiry of the June contract (and N sufficiently large), and X is the amount of electricity
consumed during the N days, then by the CLT X ~ N(7.98 N, 66.95 N). On the other hand, remaining gas
capacity G ~ N(𝜇, 𝜎 2 ).
G - X ~ N(𝜇 - 7.98N, 𝜎 2 + 66.95N)
Probability of settling at $3, P = P(G-X < 0) = norm.cdf(0, loc=𝜇 - 7.98N, scale=𝜎 2 + 66.95N)
Fair price of June contract = $2 + P
We demonstrate this with 2016 data, which interestingly settled at $2.
In [208]: from scipy.stats import norm
case 1 blog 4/1/21, 11:11 PM

mu = G - data[6].cons[:181] - 7.98 * np.arange(181, 0, -1)
std = np.sqrt(46 + 66.95 * np.arange(181, 0, -1))
P = norm.cdf(0, loc=mu, scale=std)
price = 2 + P
plt.plot(price, label="Predicted price")

plt.ylim(2, 3)
plt.ylabel("Price")
plt.show()
case 1 blog 4/1/21, 11:11 PM
On the Erlang Distribution (https://en.wikipedia.org/wiki/Erlang_distribution)
Suppose there are N days until the contract expiry. The amount of electricity used in one day ~ Exp(𝜆 ).
What's the distribution of the amount of electricity, E, used in N days?
It turns out that E is an Erlang distribution with k = N and rate = 𝜆 . If you want to learn more about the
derivation, you can read this resource (https://towardsdatascience.com/sum-of-exponential-random-
variables-b023b61f0c0f).
Here's what the distribution looks like for various values of N, compared to the normal distribution. For large
values for N, the normal distribution closely approximates the Erlang distribution, but at smaller values of N
the difference diverges significantly, motivating us to compute the probabilities more accurately.
In [253]: for n in [1, 5, 20, 50]:

plt.title(f"N = {n}")
x = np.linspace(-n/lmb, n / lmb * 3, 100)
plt.xlabel("x")
plt.ylabel("P(X=x)")
plt.plot(x, gamma.pdf(x, n, scale=1/lmb), label=f"Erlang(n, {1/lmb
:.2f})")
plt.plot(x, norm.pdf(x, loc=n/lmb, scale=np.sqrt(n)/lmb), label=f"
N({n/lmb:.2f}, {np.sqrt(n)/lmb:.2f})")
plt.show()
case 1 blog 4/1/21, 11:11 PM
case 1 blog 4/1/21, 11:11 PM
case 1 blog 4/1/21, 11:11 PM
Direct computation using the Erlang Distribution
Once again, we model the remaining gas G ~ N(𝜇, 𝜎 2 ).
We want to know P = P(G < E) = P(G - E < 0).
P(G - E < 0) = ∫𝐺 𝑃 (𝑔 − 𝐸 < 0)𝑃 (𝐺 = 𝑔)𝑑𝐺
While this integral looks daunting to compute mathematically, we can simply compute it numerically to
enough precision. We reduce the infinite range (-inf, inf) to (𝜇 - 5𝜎 , 𝜇 + 5𝜎 ), and take k = 10000 evenly
spaced points in this range. With this, our integral becomes a summation which we can compute via a dot
product.
𝜇+5𝜎
P(G - E < 0) = ∑𝜇−5𝜎 𝑃 (𝑔 − 𝐸 < 0)𝑃 (𝐺 = 𝑔)
In [242]: from scipy.stats import gamma

def prob_is_gas2(gas_left, n_days):
gas_var = 66.95
n_pts = 10000
x_range = np.linspace(-5*(gas_var**0.5), min(5*(gas_var**0.5), gas
_left), n_pts)
nms = norm.pdf(x_range, scale=gas_var**0.5)
gams = gamma.cdf(gas_left - x_range, n_days, scale=1/lmb)
ans = np.dot(gams,nms)
ans /= np.sum(nms)
return ans
direct_pred = 3 - data[6][:181].apply(lambda row: prob_is_gas2(G - row

.cons, 181 - row.name), axis=1)
plt.plot(price, label="CLT pred")

plt.plot(direct_pred, label="Direct pred")
plt.ylim(2, 3)
plt.ylabel("Price")
plt.show()
case 1 blog 4/1/21, 11:11 PM
The CLT prediction seems to be good enough for most purposes. Zooming into the last few days, we see
that the prices diverge by a few cents, still significant but in our competition's context isn't going to make a
big difference in profitability.
case 1 blog 4/1/21, 11:11 PM
In [278]: plt.plot(price, label="Normal pred")

plt.plot(direct_pred, label="Direct pred")
plt.ylim(2, 3)
plt.xlim(160, 182)
plt.ylabel("Price")
plt.show()
Fairs = computed
Before the competition, we put a lot of effort into analyzing historical data and computing our fairs, so we
thought that a lot of our success came from these good fairs. Looking back, these fairs may have put us
above the poor or middle performing teams, but didn't differentiate us from the other top teams. Read on for
what really mattered!

Case 1 Blog

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Case 1 Blog

Uploaded by

Copyright:

Available Formats

case 1 blog 4/1/21, 11:11 PM

What's our fair?

In [1]: import numpy as np

In [2]: plt.rcParams['figure.figsize'] = (8.0, 6.0)

In [3]: !ls case1/DATA

2010.csv 2012.csv 2014.csv 2016.csv 2018.csv

In [4]: data = [pd.read_csv(f'case1/DATA/{year}.csv', header=0, names=['date',

0 2010-01-01 0.000000 0.0

1 2010-01-02 0.447285 2.0

2 2010-01-03 14.339777 2.0

3 2010-01-04 17.388527 2.0

4 2010-01-05 28.372350 2.0

The 4 contract expiry dates are plotted as vertical red lines.

In [8]: plt.title("Daily consumption in 2010")

In [9]: # Convert cumulative consumption to daily

Gas and coal supplies

We will investigate the pattern of gas/coal consumption over the 10 years.

In [12]: from sklearn.linear_model import LinearRegression

In [90]: gasreg = LinearRegression().fit(np.arange(10).reshape(-1, 1), gas)

In [91]: coalreg = LinearRegression().fit(np.arange(10).reshape(-1, 1), coal)

What's the underlying distribution of consumption?

In [15]: # Combining consumption over 10 years

In [17]: plt.title("Histogram of consumption")

Hmm, does this look like an exponential distribution?

The exponential distribution has PDF of 𝜆𝑒−𝜆𝑥 , with mean of 1𝜆

In [18]: lmb = 1 / np.mean(cons)

In [19]: plt.title("Histogram of consumption")

In [60]: print("Sample variance:", np.var(cons))

Sample variance: 66.954336216657

Autocorrlelation between consecutive days.

The low correlation suggests lack of relationship between consecutive days.

In [20]: np.corrcoef(cons[1:], cons[:-1])

Out[20]: array([[ 1. , -0.02293147],

Exponentially weighted moving averages

There doesn't seem to be a clear periodic trend in consumptions

In [31]: plt.title("EWMA of consumption, alpha = .01")

Correlations for n-day periods, n ∈ [1, 50)

Again, we fail to observe any interesting patterns

In [36]: for i in range(1, 50):

Putting it all together

In [67]: print("\t".join(["Year", "June", "Aug", "Oct", "Dec"]))

Year June Aug Oct Dec

August, October and December contracts

In [201]: G = gasreg.predict(np.full((1,1), 5))[0]

A small improvement in G and C predictions

The June contract

Assuming normality using CLT

G - X ~ N(𝜇 - 7.98N, 𝜎 2 + 66.95N)

Fair price of June contract = $2 + P

We demonstrate this with 2016 data, which interestingly settled at $2.

In [208]: from scipy.stats import norm

In [272]: G = gasreg.predict(np.full((1,1), 6))[0]

plt.plot(price, label="Predicted price")

On the Erlang Distribution (https://en.wikipedia.org/wiki/Erlang_distribution)

In [253]: for n in [1, 5, 20, 50]:

Direct computation using the Erlang Distribution

Once again, we model the remaining gas G ~ N(𝜇, 𝜎 2 ).

We want to know P = P(G < E) = P(G - E < 0).

P(G - E < 0) = ∫𝐺 𝑃 (𝑔 − 𝐸 < 0)𝑃 (𝐺 = 𝑔)𝑑𝐺

In [242]: from scipy.stats import gamma

In [279]: G = gasreg.predict(np.full((1,1), 6))[0]

direct_pred = 3 - data[6][:181].apply(lambda row: prob_is_gas2(G - row