You are on page 1of 16

!"#$%&'()*%+ ,-.

/0)&*1/2/*34%(+4(

E-"*'/@(*7*B0((*>(>#(0F-+$D*&2-0%(&*$(B2*2'%&*>-+2'5
3%C+*"G*B-0*A()%">*/+)*C(2*/+*(H20/*-+(

105*6-#(02*7ü#$(0 8-$$-.

9"$*:; < =*>%+*0(/) < ?%&2(+

3/@(

!"#$%&"'()&$*"*+,&+"-(."*/$0&'1(.&2
.34$-&'1(&'(5#.6
?(/0+*'-.*2-*#"%$)*AAA&*B-0*)%BB(0(+2*4-"+20%(&*2'(*0%C'2*./D
!'-2-*#D*I++%(*3G0/22*-+*J+&G$/&'

I magine that you work in a marketing department and want to investigate


if your advertising investments paid off. People who read my other articles
know a way to do this: marketing mix modeling! This method is especially
interesting because it also works in a cookieless world.

8'0*349+0&3'(03(."*/$0&'1(.&2(.34$-&'1(&'(5#0,3'
K'%4'*/)@(02%&%+C*&G(+)%+C&*/0(*0(/$$D*)0%@%+C*D-"0*&/$(&L
2-./0)&)/2/&4%(+4(54->

I have also covered Bayesian marketing mix modeling, a way to get more
robust models and uncertainty estimates for everything you forecast.
!"#$%&"'(."*/$0&'1(.&2(.34$-&'1(&'(5#0,3'(:&"(5#.6;
M&2%>/2(*2'(*&/2"0/2%-+N*4/00D-@(0N*/+)*-2'(0*G/0/>(2(0&*/$$
/2*-+4(N*%+4$")%+C*2'(%0*"+4(02/%+2D
2-./0)&)/2/&4%(+4(54->

The Bayesian approach works exceptionally well for homogeneous data,


meaning that the effects of your advertising spendings are comparable across
your dataset. But what happens when we have a heterogeneous dataset, for
example, spendings across several countries? Two obvious ways to deal with
it are the following:

1. Ignore the fact that there are several countries in the dataset and build a
single big model.

2. Build one model per country.

Unfortunately, both methods have their disadvantages. Long story short:


Ignoring the country leaves you with a model that is too coarse, probably it
will be underfitting. On the other hand, if you build a model per country, you
might end up with too many models to keep track of. And even worse, if some
countries don’t have many data points, your model there might overfit.

Usually, it is more efficient to create a hybrid model that is somewhat between


these two approaches: Bayesian hierarchical modeling! You can read more
about it here as well:

!"#$%&"'()&$*"*+,&+"-(.34$-&'1(&'(5#.6;
O"0%+C*D-"0*>-)($P&*/>+(&%/
2-./0)&)/2/&4%(+4(54->
:
How we can benefit from this in our case exactly you ask? As an example,
Bayesian hierarchical modeling could produce a model where the TV carryover
values in neighboring countries are not too far apart from each other, which
counters overfitting effects.

However, if the data clearly suggests that parameters are in fact completely
different, the Bayesian hierarchical model will be able to pick this up as well,
given enough data.

In the following, I will show you how to combine the Bayesian marketing mix
modeling (BMMM) with the Bayesian hierarchical modeling (BHM)
approach to create a — maybe you guessed it — a Bayesian hierarchical
marketing mix model (BHMMM) in Python using PyMC.

!"###$%$!###$&$!"#
Researchers from the former Google Inc. have also written a paper about this
idea that I encourage you to check out later as well. [1] You should be able to
understand this paper quite well after you have understood my articles about
BMMM and BHM.

Note that I do not use PyMC3 anymore but PyMC, which is a facelift of this
great library. Fortunately, if you knew PyMC3 before, you will be able to pick
up on PyMC as well. Let’s get started!

5*$<"*"0&3'%
First, we will load a synthetic dataset that I made up myself, which is fine for
training purposes.
:
dataset_link =
"https://raw.githubusercontent.com/Garve/datasets/fdb81840fb
96faeda5a874efa1b9bbfb83ce1929/bhmmm.csv"

data = pd.read_csv(dataset_link)

X = data.drop(columns=["Sales", "Date"])
y = data["Sales"]

UG(+*%+*/GG V(2*&2/02()

RST R
,'(*)/2/N*40(/2()*#D*2'(*/"2'-05*Q>/C(*#D*2'(*/"2'-05

Now, let me copy over some functions from my other article, one for
computing exponential saturation and one for dealing with carryovers. I
adjusted them — i.e. changed theano.tensor to aesara.tensor , and tt to at

— to work with the new PyMC.


:
import aesara.tensor as at

def saturate(x, a):


return 1 - at.exp(-a*x)

def carryover(x, strength, length=21):


w = at.as_tensor_variable(
[at.power(strength, i) for i in range(length)]
)

x_lags = at.stack(
[at.concatenate([
at.zeros(i),
x[:x.shape[0]-i]
]) for i in range(length)]
)

return at.dot(w, x_lags)

We can start modeling now.

!)...(!9&-4&'1
Before we start with the full model, we could start by building separate
models, just to see what happens and to have a kind of baseline.

=$<"*"0$(.34$-%
If we follow the methodology from here, we get for Germany:
:
A-)($*G0()%42%-+*B-0*V(0>/+D5*Q>/C(*#D*2'(*/"2'-05

A quite nice fit. However, for Switzerland we only have 20 observations, so the
predictions are not too great:

A-)($*G0()%42%-+*B-0*3.%2W(0$/+)5*Q>/C(*#D*2'(*/"2'-05

This is exactly the reason why the separate models approach is sometimes
problematic. There is reason to believe that the people in Switzerland are not
completely different from the people in Germany regarding the impact of
media on them, and a model should be able to capture this.

We can also see what the Switzerland model has learned about the
parameters:

coefTV
:
coefTV satTV carT\

mean9123 mean=1 mean=0.48

94%HDI 94%HDI 94%HDI

1127 5098 4e-05 2.9 0.059 0.87

0 2000400060008000100001200014000 10 0.0 0.2 0.4 0.6 0.8 1.0


coefRadio satRadio carRadio

mean=1399 mean=0.97 mean=0.49

94%HDI 9426HDI 94%HDI

0.87 3212 8.5e-05 2.8 0.057 0.88

0 1000 2000 3000 4000 5000 0.0 0.2 0.4 0.6 0.8 1.0
coefBanners satBanners carBanners

mean=-1453 mean-097 mean=0.5

94%HD 94%HDI 94%HDI

1.2 3296 6.1e-05 2.8 0.12 0.91

20004000600080001000012000 0.0 0.2 0.4 0.6 0.8 1.0

base noise

mean=1333 hear42328

94%HDI 94%HDI

0.58 3187 162 3195

1000 2000 3000 4000 5000 1500 2000 2500 3000 3500 4000 4500

!-&2(0%-0&*B-0*2'(*3.%2W(0$/+)*>-)($5*Q>/C(*#D*2'(*/"2'-05

The posteriors are still quite wide due to the lack of Switzerland data points.
You can see this from the car_ parameters on the right: the 94% HDI of the
carryovers nearly spans across the entire possible range between 0 and 1.

Let us build a proper BHMMM now, so especially Switzerland can benefit from
the larger amount of data that we have from Germany and Austria.

5#.6(8><-$>$'0"0&3'
We introduce some hyperpriors that shape the underlying distribution over
all countries. For example, the carryover is modeled using a Beta distribution.
This distribution has two parameters α and β, and we reserve two hyperpriors
car_alpha and car_beta to model these.
:
In Line 15, you can see how the hyperpriors are used then to define the
carryover per country and channel. Furthermore, I use more tuning steps than
usual — 3000 instead of 1000 — because the model is quite complex. Having
more tuning steps gives the model an easier time inferring.

1 with pm.Model() as bhmmm:


2 # Hyperpriors
3 coef_lam = pm.Exponential("coef_lam", lam=10)
4 sat_lam = pm.Exponential("sat_lam", lam=10)
5 car_alpha = pm.Exponential("car_alpha", lam=0.01)
6 car_beta = pm.Exponential("car_beta", lam=0.01)
7 base_lam = pm.Exponential("base_lam", lam=10)
8
9 # For each country
10 for country in X["Country"].unique():
11 X_ = X[X["Country"] == country]
12 channel_contributions = []
13
14 # For each channel, like in the case without hierarchies
15 for channel in ["TV", "Radio", "Banners"]:
16 coef = pm.Exponential(f"coef_{channel}_{country}", lam=coef_lam)
17 sat = pm.Exponential(f"sat_{channel}_{country}", lam=sat_lam)
18 car = pm.Beta(f"car_{channel}_{country}", alpha=car_alpha, beta=
19
20 channel_data = X_[channel].values
21 channel_contribution = pm.Deterministic(
22 f"contribution_{channel}_{country}",
23 coef * saturate(carryover(channel_data, car), sat),
24 )
25
26 channel_contributions.append(channel_contribution)
27
28 base = pm.Exponential(f"base_{country}", lam=base_lam)
29 noise = pm.Exponential(f"noise_{country}", lam=0.001)
30
31 sales = pm.Normal(
32 f"sales_{country}",
33 mu=sum(channel_contributions) + base,
34 sigma=noise,
35 observed=y[X_.index].values,
:
35 observed=y[X_.index].values,
36 )
37
38 trace = pm.sample(tune=3000)

bhmmm.py hosted with ❤ by GitHub view raw


:
And that’s it!

6,$+/&'1(0,$(?90<90
Let us only take a look at how well the model captures the data.
:
XYAAA*G0()%42%-+*0(&"$2&*-B*V(0>/+D*/+)*I"&20%/5*Q>/C(*#D*2'(*/"2'-05

I will not conduct any real checks with metrics now, but from the plots, we can
see that the performance of Germany and Austria looks quite well.

If we compare Switzerland from the BHMMM to the version of the BMMM


from before, we can also see that it looks so much better now.

XYAAA*Z$(B2[*/+)*XAAA*Z0%C'2[*G0()%42%-+*0(&"$2&*-B*3.%2W(0$/+)5*Q>/C(*#D*2'(*/"2'-05

This is only possible because we have given Switzerland some context using
the data of other, similar countries.

We can also see how the posteriors of the carryovers of Switzerland narrowed
down:
:
Q>/C(*#D*2'(*/"2'-05

Some distributions are still a bit wild, and we would have to take a deeper look
into how to fix this. There might be sampling issues or the priors might be bad,
among other things. However, we will not do that here.

63'+-9%&3'
In this article, we have taken a quick look at two different Bayesian concepts:

Bayesian marketing mix modeling to analyze marketing spendings, as well


as

Bayesian hierarchical modeling.

We then forged an even better Bayesian marketing mix model by combining


:
both approaches. This is especially handy if you deal with some hierarchy, for
example, when building a marketing mix model for several related countries.

This method works so well because it gives the model context: if you tell the
model to give a forecast for one country, it can take the information about
other countries into account. This is crucial if the model has to operate on a
dataset that is otherwise too small.

Another thought that I want to give you on your way is the following: In this
article, we used a country hierarchy. However, you can think of other
hierarchies as well, for example, a channel hierarchy. A channel hierarchy
can arise if you say that different channels should behave not too differently,
for example if your model not only takes banner spendings but banner
spendings on website A and banner spendings on website B, where the user
behavior of websites A and B are not too different.

@$A$*$'+$%
[1] Y. Sun, Y. Wang, Y. Jin, D. Chan, J. Koehler, Geo-level Bayesian
Hierarchical Media Mix Modeling (2017)

I hope that you learned something new, interesting, and useful today. Thanks
for reading!

As the last point, if you

1. want to support me in writing more about machine learning and

2. plan to get a Medium subscription anyway,


:
why not do it via this link? This would help me a lot!

To be transparent, the price for you does not change, but about half of the
subscription fees go directly to me.

Thanks a lot, if you consider supporting me!

'($)*+$,-./$-0)$1+/234*025$6743/$8/$*0$940:/;'0<

!"#$%&'%()*%+,-%./*"/01-
XD*,-./0)&*1/2/*34%(+4(

M@(0D*,'"0&)/DN*2'(*\/0%/#$(*)($%@(0&*2'(*@(0D*#(&2*-B*,-./0)&*1/2/*34%(+4(]*B0->*'/+)&F-+*2"2-0%/$&*/+)
4"22%+CF()C(*0(&(/04'*2-*-0%C%+/$*B(/2"0(&*D-"*)-+^2*./+2*2-*>%&&5_,/`(*/*$--`5

V(2*2'%&*+(.&$(22(0
:
I#-"2 Y($G ,(0>& !0%@/4D

2-3%3,-%4-5"&6%/''
:

You might also like