1 views

Uploaded by John

- Don't Increase Federal Gasoline Taxes—Abolish Them, Cato Policy Analysis No. 598
- ecaf_867
- TRUE FALSE q2 Flashcards _ Quizlet
- Hahnel. the Case Against Markets.jei Vol41n4 2007
- An Economical, Political and Philosophical Analysis of Externalities.
- The Market Failure Revision and Resits
- L5 Market Failure
- Nepalese Labor
- Review PPt
- Intro Paper
- Slides5 Signaling
- Shops and the City: Evidence on Local Externalities and Local Government Policy from Big-Box Bankruptcies
- Jignasa Mobile Science Lab writeup
- GY201_overheads_2015_LT_6-7
- EBOOK_IDR2016_FULLREPORT.pdf
- p44.pdf
- 10_4E
- Policy Assignment- Externalities Grp Work
- Economics
- C02 Risk Perceptions and Reactions

You are on page 1of 30

August 4, 2016

1 Introduction

The purpose of this paper is to survey recent developments in a literature that combines ideas from

experimentation, learning, and strategic interactions. Because this literature is multifaceted, let

us start by circumscribing our overview. First and foremost, all surveyed papers involve nontrivial

dynamics. Second, we will restrict attention to models that deal with uncertainty. Models of pure

moral hazard, in particular, will not be covered. Third, we exclude papers that focus on monetary

transfers. Our goal is to understand incentives via other channels –information in particular, but

also delegation. Fourth, we focus on strategic and agency problems, and so leave out papers whose

scope is decision-theoretic. However, rules are there to be broken, and we will briefly discuss some

papers that deal with one-player problems, to the extent that they are closely related to the issues

at hand. Finally, we restrict attention to papers that are relatively recent (specifically, we have

chosen to start with Bolton and Harris, 1999).

Our survey is divided as follows. First, we start with models of strategic experimentation.

These are abstract models with few direct economic applications, but they develop ideas and

techniques that percolate through the literature. In these models, players are (usually) symmetric

and externalities are (mostly) informational.

Moving beyond the exploitation/exploration trade-off, we then turn to agency models that

introduce a third dimension: motivation. Experimentation must be incentivized. The first way this

can be done (Section 3) is via the information that is being disclosed to the agent performing the

experimentation, by a principal who knows more or sees more. A second way this can be done is

via control. The nascent literature on delegation in dynamic environments is the subject of Section

4.

Section 5 turns to models in which information disclosure is not simply about inducing experi-

mentation, but manipulating the agent’s action in broader contexts. To abstract from experimen-

tation altogether, we assume that the principal knows all there is to know, so that only the agent

faces uncertainty.

Finally, Section 6 discusses experimentation with more than two arms (Callander, 2011).

∗

This survey was prepared for the Econometric Summer Meetings in Montréal, 2015. We thank Kostas Bimpikis,

Alessandro Bonatti, Daria Khromenkova, Nicolas Klein, Erik Madsen and Chiara Margaria for detailed comments.

1

2 Equilibrium Interactions

2.1 Strategic Bandits

Strategic bandit models are game-theoretic versions of standard bandit models. While the stan-

dard “multi-armed bandit” describes a hypothetical experiment in which a player faces several slot

machines (“one-armed bandits”) with potentially different expected payouts, a strategic bandit in-

volves several players facing (usually, identical) copies of the same slot machine. Players want to

stick with the slot machine if and only if the best payout rate makes it worth their time, and learn

not only from their own outcomes but also from their neighbors.

Equilibrium strategies are not characterized by simple cut-offs in terms of the common belief. As

a result, solutions to strategic bandits are only known for a limited class of distributions, involving

two states of the world only. In Bolton and Harris (1999, BH), the observation process (of payoffs)

follows a Brownian motion, whose drift depends on the state. In Keller, Rady and Cripps (2005,

KRC), it follows a simple Poisson process, with positive lump-sums (“breakthroughs”) occurring

at random (exponentially distributed) times if and only if the arm is good. Keller and Rady (2015,

KR15) solve the polar opposite case in which costly lump-sums (“breakdowns”) occur at random

times if and only if the arm is bad. Keller and Rady (2010, KR10) consider the case in which

breakthroughs need not be conclusive.

These models share in common, in addition to the binary state framework, their focus on

symmetric Markov perfect equilibria (MPE).1 Throughout, players are Bayesian. Given that they

observe all actions and outcomes, they share a common belief about the state, which serves as the

state variable. They are also impatient and share a common discount rate.

BH and KR10 are the most ambitious models and offer no closed-form solutions for the equilib-

rium. Remarkably, however, they are able to prove uniqueness and tease out not only their structure,

but their dependence on parameters. While it is BH that first develops both the ideas (including

the concepts of free-riding and encouragement effects) as well as the methods used throughout this

literature, the most interesting insights can already be gleaned from the simple exponential bandits

in KRC and KR15. What makes these two models tractable is that these models can be viewed

as deterministic: unless a breakdown or breakthrough (“news”) occurs, the (conditional) posterior

belief follows a known path. If news arrives, the game is over, since if it is commonly known that

the state is good (or bad), informational externalities cease to matter, and each player knows the

strictly dominant action to take.

Let us consider here a simplified version combining the good and bad news models. The state is

ω ∈ {G, B}. Each player i = 1, . . . , I controls the variable uit ∈ [0, 1], which is the fraction allocated

to the risky arm at time t ≥ 0 (the complementary fraction being allocated to the safe arm). The

horizon is infinite. This leads to a total realized payoff of

Z ∞

e−rt (hdNG,t

i i

− ℓdNB,t ),

0

where r > 0 is the discount rate, h, ℓ > 0 are the value and cost of a breakthrough or breakdown,

i , Ni

and NG,t B,t are Poisson processes with intensities

which are conditionally independent across players. Here, λG , λB ≥ 0 are parameters. News is

conclusive: if any player experiences a breakthrough (breakdown), all players immediately learn

1

KRC and KR15 discuss asymmetric MPE as well, although this remains an open and challenging problem for

the Brownian case.

2

that the state is good (bad) and allocate weight 0 (1) to the risky arm, the alternative having value

normalized to 0.

The main distinction in these models hinges on the direction in which updating occurs in the

absence of any news. The (conditional) belief pt = P[ω = G] that the state is good evolves according

to

I

X

ṗt = −∆pt (1 − pt ) uit , where ∆ := λG − λB .

i=1

Hence, if λG > λB , then this belief drifts down over time. We call such a model a good news

model (because a special case is λB = 0, in which case the only news that might arrive is good). If

λG < λB , the belief drifts up, unless news arrives. This is the bad news model.2

The first best, or cooperative, outcome is simple. Players should follow a common cut-off

(Markov) policy, using the risky arm if and only if the belief pt is above some threshold (say, p̄I in

the bad news case, and pI in the good news one). Such behavior results in very different dynamics

according to the news scenario. In the bad news case, experimentation goes on forever unless (bad)

news occurs. In the good news model, experimentation stops unless (good) news occurs. This

drastically changes equilibrium predictions.

To understand incentives, let us fix players −i’s strategies to what they would do, each on

their own (that is, to be the the optimal single-agent policy). Let us start with the bad news case,

in which this involves a threshold p̄1 above which they experiment (set ui = 1). This threshold

is below the myopic threshold pm at which the expected reward from taking the risky action is

precisely 0, namely,

λG pm h = λB (1 − pm )ℓ.

This is because the option value from learning via experimentation makes it worthwhile for a lone

player to use the risky arm at beliefs slightly below pm , when the flow loss from doing so is small

enough.

Consider now the best-reply of a player to such a strategy by the others at beliefs in the neighbor-

hood of this threshold p̄1 . Right above p̄1 , player i need no longer carry out costly experimentation

on his own to learn about the state, because other players are experimenting. Hence, there is a

range of beliefs right above p > p̄1 in which his best-reply is to use the safe arm only, depressing

overall experimentation at such beliefs. This is the free-riding effect. See Figure 1. As a result,

player i’s payoff is boundedly higher than what it would be if he were on his own, given that the

value of information is strictly positive.

Consider now beliefs right below p̄1 . If player i does not experiment, nothing will happen, given

that nobody else does. Yet, getting the belief “over the edge” p̄1 is very valuable, as it will kickstart

experimentation by players −i. Because player i would be indifferent at p̄1 if he were on his own,

without the added benefit provided by others’ experimentation, he strictly prefers to do so if this is

the prerequisite to get others started. Hence, there is a range of beliefs below p̄1 at which player i’s

best-reply is to experiment. This the encouragement effect. Absent free-riding, the encouragement

effect is responsible for the monotonicity of the cooperative amount of experimentation to increase

with the number of players (e.g., p̄I decreasing in I). Of course, for very low beliefs, playing safe

is the best-reply.

Hence, the best-reply to the optimal (single-player) cut-off policy is not a cut-off policy. In

fact, it is not monotone. With some work, however, it can be shown that a symmetric monotone

(pure-strategy) equilibrium exists, but it involves interior levels of experimentation (ui ∈ (0, 1)) for

some beliefs.

2

The special case λG = λB can be nested in either case.

3

pt

Play R

p̄1

Play R: Encouragement Effect

Play S

t

0 1 2 3 4

Figure 1: Best-reply to the one-player optimal policy under bad news.

In contrast, consider now the case of good news. Figure 2 illustrates the best-reply to the one-

player optimal policy, which specifies play of the risky arm if and only if p ≥ p1 . Beliefs are now

partitioned into only three intervals. There is no encouragement region for beliefs below p1 . This is

because of the dynamics of beliefs. Here, experimentation at such beliefs does not move the common

belief closer to the region at which other players would experiment (unless a breakthrough occurs,

obviously). Because the conditional belief absent a breakthrough only drifts down further, a player

is on his own, for beliefs below p1 . Hence, by the definition of this threshold, it is optimal to play

safe. This does not imply that the one-player optimal policy is an equilibrium, however, because at

beliefs immediately above p1 , an agent still has incentives to free-ride on others’ experimentation.

Because p1 is strictly below the myopic threshold at which playing risky already pays off in the

next instant, playing risky at such beliefs is motivated by the option value from learning. Because

other players perform this experimentation, it is then best to play safe oneself, at least for beliefs

that are sufficiently close to (and above) p1 .

Hence, here as well, the symmetric equilibrium involves interior levels of experimentation. In

this sense, it looks similar to the symmetric equilibrium in the bad news case. But the absence of

the encouragement region for beliefs below p1 implies that experimentation levels are lower than the

socially efficient level, with experimentation ceasing altogether at this threshold independent of the

number of players. In particular, independent of I, the total amount of experimentation performed

by players over the infinite horizon (which the asymptotic belief precisely measures) is the same

as if there was only one player. Delay is higher, as one can show that, unlike with one player,

this belief is only asymptotically achieved, as experimentation rates dwindle down when the belief

approaches the threshold. Players are better off, of course (they can always ignore the existence of

other players, and replicate the one-player policy and payoff), because the cost of experimentation

is shared among them. Nonetheless, the outcome is inefficient.

We now provide some details on how these symmetric equilibria can be solved for. Consider

the Hamilton-Jacobi-Bellman (HJB) equation for player i’s continuation value v, taking as given

P

the aggregate experimentation by other players, u−i (p) = j6=i uj (p). It holds that

rv(p) = max{ui π(p) + (ui + u−i (p))[(pλG (λG h/r − v(p)) − (1 − p)λB v(p))) − ∆p(1 − p)v ′ (p)]},

ui | {z } | {z }

jump in value if news arrives drift in value

if no news arrives

where

π(p) = pλG h − (1 − p)λB ℓ

4

pt

Play R

S: Free-Riding

p1

Play S

t

0 1 2 3 4

Figure 2: Best-reply to the one-player optimal policy under good news.

is the expected flow payoff from the risky arm. Over the next instant, π(p) is collected by player i

at rate ui . Either one of the players experiences a breakthrough or breakdown, in which case the

value jumps from v(p) to either λG h/r if a breakthrough occurs, or 0 if it is a breakdown. In the

absence of news, the value changes at a rate proportional to the belief change ṗ. We see that the

right-hand side is linear in ui , implying that over a range of beliefs [p1 , p2 ] over which player i is

indifferent, the coefficient on ui must be zero, that is,

π(p) + (pλG (λG h/r − v(p)) − (1 − p)λB v(p))) − ∆p(1 − p)v ′ (p) = 0, (1)

a differential equation that can be solved explicitly, given the relevant boundary conditions. Adding

and subtracting u−i (p)π(p) from the HJB equation (and using (1) to eliminate almost all terms)

gives

rv(p) = −u−i (p)π(p), (2)

over such a range.3

In a symmetric equilibrium, u−i (p) = (I − 1)ui (p), and the optimal experimentation level

follows, as explained next. Equations (1)–(2) are equally valid for good and bad news. The

difference appears in the boundary conditions. In the case of bad news, the value drifts up. It is

readily verified that ui increases in p, and so p2 is determined by ui (p2 ) = 1 and v(p2 ) = v̄(p2 ),

which together with (2) imply rv̄(p2 ) = −(I − 1)π(p2 ), where v̄ is the per-player value from the

cooperative solution discussed above. Once this threshold is reached, players achieve the first best.

Because of the upward drift, this suffices to determine the candidate value v(p) for all p < p2 .

Because playing safe is always an option, v(p) ≥ 0, and finding the solution to v(p1 ) = 0 then

determines p1 , below which indifference no longer holds and the safe arm is exclusively played.

3

Because v(p) ≥ 0, we see that such a range must be a subset of the beliefs over which π is negative. This should

not come as a surprise because above the myopic threshold players certainly use the risky arm.

5

v(p) v(p)

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

Figure 3: Value function with bad news. Figure 4: Value function with good news.

In contrast, with good news, the candidate value function v over the to-be-determined interval of

beliefs [p1 , p2 ] over which experimentation levels are interior must be solved “bottom-up,” starting

from p1 . As argued, p1 = p1 , the level at which a lone player ceases experimentation. In addition,

v ′ (p1 ) = 0 (smooth-pasting) must hold.4 This suffices to solve (1) for the candidate value v, and

p2 is determined by u−i (p2 ) = I − 1. Hence, solving v(p2 ) = −r(I − 1)π(p2 ) gives p2 , above which

indifference no longer holds and risky is played exclusively.

See Figure 3 and Figure 4 for an illustration of the value functions. The intermediate (dashed)

curve is the symmetric equilibrium value function (as a function of the belief), and the lower (dotted)

and higher (solid) curves are the single-player and cooperative value (v̄) functions. Considering the

left figure (bad news), the encouragement effect drives the larger range of beliefs at which players

experiment (as can be shown), relative to the single-player case, yet the free-riding effect leads to an

amount of experimentation that remains too low nonetheless. In the right figure (good news), the

impact of free-riding has a more dramatic impact, as experimentation stops at the same threshold,

as in the single-player case.

To what extent is the solution concept (symmetric MPE) driving these results? As KRC

show, symmetry is partly responsible for the inefficiency, and equilibria with higher payoffs can

be achieved when players take turns experimenting in a Markov equilibrium, for instance. More

generally, Hörner, Klein and Rady (2015) show that the Markov restriction plays an important

role in the inefficiency result, as well as the equilibrium structure. Strongly symmetric equilibrium

(SSE), in which all players choose the same continuation strategy after a given history (independent

of the identity of a potential deviator, for instance), is a restrictive solution concept in general, yet it

is still weaker than symmetric MPE which are automatically SSEs. In the experimentation context,

SSE turn out to allow for exactly efficient equilibria for a range of (but not all) parameters. This is

the case, in particular, whenever news is bad (whether it is conclusive or not), or in the Brownian

case. Furthermore, as far as the total surplus (or the players’ average payoff) is concerned, it can

be shown that, in all cases, no perfect Bayesian equilibrium outcome outperforms SSE.

Several extensions are worth mentioning, although space does not allow us to discuss them

further. Klein and Rady (2011) solve the case of negatively correlated arms (a disease might be

either bacterial or viral, but not both). Players might be able to communicate publicly, in the

absence of direct monitoring of each others’ experimentation, a case considered by Heidhues, Rady

4

This derivative cannot be negative, since, for all ǫ > 0, v(p1 + ǫ) ≥ 0 = v(p1 ). It cannot be positive either.

Otherwise, there is a solution to (1) with v(p1 − ǫ) = 0 as a boundary condition, giving v(p1 ) > 0, a contradiction.

6

and Strack (2015). Players might be asymmetric (Das, 2015), etc. Nonetheless, open questions

remain even for the benchmark models. While Cohen and Solan (2013) solve the problem of

optimal experimentation for a single player for a reasonably general class of Lévy processes (not

including the exponential bad news case, however), a similar analysis that would encompass both

the Brownian and the exponential model remains to be done. Also, whereas in the exponential case,

KRC (and KR15) were able to solve for the asymmetric MPE, it is unknown whether asymmetric

MPE exist in the Brownian case.5

What if players do not observe the outcomes of the other players, but only what they do? And

what if players don’t observe the actions of the other players, but only the outcomes? It is easy to

think of examples of either scenario: firms might observe each others’ profits, but not techniques;

co-authors see each others’ output, not input. On the other hand, consumers see each others’

choices, not derived utilities; and firms might see each others’ supply chain without being able to

evaluate their satisfaction with specific suppliers.

The first problem –observed actions, unobserved outcomes– remains largely unsolved. Rosen-

berg, Solan and Vieille (2007) provide a partial answer to it, under the assumption that switching

to the safe arm is an irreversible choice, while also making clear that this assumption is restrictive:

players would like to switch back to the risky arm under circumstances that occur with positive

probability. The second problem –unobservable actions, observed outcomes– is better understood,

at least in the case of exponential bandits. The answer further underscores the differences between

good and bad news bandits.

Rosenberg, Solan and Vieille (2007) prove that, when the switch to the safe arm is irreversible,

the optimal policy remains an index policy (a generalization of a threshold policy defined shortly)

under weak distributional assumptions.

Consider the case of two players (the focus of their paper). By round n, assuming player 1 has

not switched to the safe arm yet, two possibilities arise. Either player 2 has not either, or he has

in round k < n. Not observing player 2 stop is a good sign about 2’s observed payoffs. As a result,

player 1 uses a simple cut-off rule: he compares his private belief, based on his own observations

only, to a threshold that is non-increasing with n. If and only if his belief exceeds this threshold,

he continues experimenting. If, instead, player 2 switched to the safe arm in round k, player 1 faces

the standard decision problem of a one-armed bandit, for which the solution is an index policy

(with a cut-off in terms of his belief). Of course, his belief accounts for the decision of player 2 to

quit at time k. Whether an index policy remains optimal when the irreversibility assumption is

dropped is an important open question.

Bonatti and Hörner (2011) is based on the good news model, but involves a payoff externality.

As soon as a player experiences a breakthrough, the game ends, and every player derives the same

utility from it. However, experimenting (effort) is costly and its cost is privately borne. Think of

collaborators working on a common project from different locations. Effort is not observable, but

the output is automatically reported on Dropbox –if someone cracks the problem, it is immediately

common knowledge among players. Importantly, effort is private information. Free-riding is present,

but deviations do not lead to changes in the opponents’ beliefs, as such deviations are not observed.

As a result, the unique symmetric equilibrium, which as in KRC involves interior effort, leads

to higher effort provision than with observable effort. This is because unexpected, but observed,

shirking leads to a belief that is higher than it would be otherwise, and this optimism leads to

5

As mentioned above, there exist efficient SSE in the Brownian case.

7

more effort. Hence, shirking is more attractive when it is observed, as it leads other players to work

harder in the future (relative to how much they would work if shirking was not observed).

What is the ultimate fate of such collaborations? As Bonatti and Hörner show, effort is scaled

down over time, so that the asymptotic belief makes agents just indifferent between working and not.

The project dwindles over time, but is never abandoned.6 Gordon, Marlats and Ménager (2015)

consider a variation in which there is a fixed delay ∆ > 0 between a player’s breakthrough and

his collaborator observing it. Remarkably, the unique symmetric MPE exhibits periodic behavior,

with players alternating over time phases in which they use the risky and the safe arm exclusively.

(These phases are further divided up according to whether a player is likely to hear news from his

collaborator or not, depending on whether his collaborator was using the risky arm ∆ instants ago.)

With bad news, the logic is reversed, as Bonatti and Hörner (2015) show. Observability leads

to more effort because unexpected shirking depresses collaborators, and so leads to lower effort.

Hence, such a deviation leads to an automatic punishment, and so mitigates free-riding relative

to unobserved effort. Surprisingly, the unique equilibrium is in mixed strategies. That is, players

randomize over extremal strategies, with a switching time to the risky arm (if no breakdown has

occurred until then) that is chosen at random (this is not equivalent to a pure strategy equilibrium

with interior levels of experimentation).

The fundamental difference is related to the kind of deviations that are most attractive. Under

good news, a player that briefly deviates to the safe arm unbeknown to the other players becomes

more optimistic than them, as the lack of a breakthrough is not as statistically significant for him.

Hence, his relative optimism strengthens his incentives to revert to experimentation. The deviation

“self-corrects” and local first-order conditions suffice, leading to interior but pure strategies. Under

bad news, a player that deviates the same way becomes more pessimistic than the other players,

as he views the absence of a breakdown as a more likely event. Increased pessimism pushes him

further towards the safe arm. Hence, local first-order conditions are no longer sufficient, and so

randomization must occur over pure extremal strategies.

2.3 Extensions

Irreversibility. All but one of the papers mentioned in Section 2.1 involve reversible actions.

A player can resume experimentation after playing safe, and vice-versa. This is not innocuous.

Instead, Frick and Ishii (2015) assume that using the risky arm (the adoption of a new technology)

is an irreversible choice.7 Their model is not strategic, to the extent that there is a continuum of

agents. However, these agents are forward-looking, and learn about the state of the world (whether

adoption is profitable or not) from the adoption choices of others. The more others adopt, the

more likely it is that an exponentially distributed signal publicly reveals whether the state is good

or bad. Opportunities to adopt follow independent Poisson processes, one for each of the agents.

Frick and Ishii show that the adoption patterns differ markedly across scenarios. With good news,

there cannot be a region of beliefs for which agents are indifferent between adopting or not. This

is because, if an agent is willing to invest immediately after t in the absence of news (as would be

the case in the interior of such a region, if it existed), then he must strictly prefer to invest at time

t. Indeed, the only other event that can occur is good news, in which case adoption is optimal as

well. Hence, there is no option value to waiting. As a result, the equilibrium is extremal: all agents

6

This inefficiency calls for regulation. Campbell, Ederer and Spinnewijn (2011) consider the problem of the

optimal deadline as a way of alleviating the free-riding. The intervention could be purely informational: Bimpikis,

Ehsani and Mostagir (2016) suggest introducing a “black-out” phase in which no information is shared, a policy that

is shown to be optimal within some class of policies. The optimal informational mechanism remains unknown.

7

Irreversibility presupposes that the unit resource cannot be split across arms.

8

that get the chance to adopt do so up to some date t1 . If no breakthrough was observed by that

time, agents stop adopting. As a result, the adoption pattern (the fraction of adopters) turns out

to be a concave function of time. In contrast, with bad news, there is a region of beliefs (and so,

an interval of time) over which agents are indifferent, and so only a fraction of agents that get an

opportunity to adopt do so, resulting in an S-shaped adoption pattern.

It is worth mentioning that irreversibility is not necessarily socially costly. But this depends on

which action is irreversible. In the models of Section 2.1, the first-best is an equilibrium outcome

when switching to the safe arm is irreversible. If switching to the safe arm is irreversible, it is no

longer possible to be a by-stander while others experiment. And because one still benefits from

the accelerated rate of learning that results from others experimenting, it is optimal to follow the

first-best strategy when others do so. More generally, we speculate that making the risky action

reversible and the safe action irreversible is socially valuable, as it encourages experimentation,

which is always too low.8

Murto and Välimäki (2011, 2013; hereafter MV11, MV13) consider stopping games (that is,

games where switching to the risky arm is irreversible) with observational learning and pure infor-

mational externalities.9 In both papers, stopping corresponds to taking the action whose payoff

depends on the state (in this sense, it is the risky one). The models have notable differences. In

MV11, the state is binary, values are correlated but not identical, and it is a good news model: if

the state is good, agents that have not stopped yet receive a private, independent signal revealing

the state at a rate that is exponentially distributed. In MV13, the state space is much richer and

is the (common) optimal time to invest. Signals are received at the beginning once and for all,

but the signal space is rich (the monotone likelihood ratio property is assumed to impose some

discipline). In both models, agents observe the earlier times at which other agents chose to stop,

but nothing else. A robust finding across both papers is that, even as the number of players grows

large, efficiency is not achieved. Stopping occurs too late, relative to first-best. This is particularly

striking in MV13, as it highlights the role of time passing by, and how it differs from static models,

as in common value auctions in which asymptotic efficiency obtains. Further, MV11 show how,

with many players and continuous time, exit behavior displays waves of exit interwoven with calm

periods of waiting, during which the public belief hardly budges. Exit waves occur because the exit

of an agent implies a discrete jump in the public belief, leading with positive probability to the

immediate exit of further agents whose belief was nearly the same, which in turn triggers a further

cascade of agents leaving.

Payoff externalities. Most of the papers considered so far focus on pure informational exter-

nalities. From a theoretical point of view, this allows us to isolate informational incentives. From

an economic point of view, it is very restrictive. Other externalities arise in many contexts. For

instance, competition in the product market leads to payoff externalities (with more firms “ex-

perimenting” in the risky market, profits have to be shared among more firms). Decision-making

procedures also lead to externalities, when agents’ action cannot be selected of their own free

will. Congresspeople do not decide whether they want to experiment in isolation, but rather make

decisions by majority voting.

Some papers explore such issues. Cripps and Thomas (2015) consider externalities in terms of

congestion costs.10 Imagine stepping out of an unfamiliar airport, hoping to catch a cab at the

curbside. Naturally, there are signs above your head providing potentially critical information for

8

We are not aware of a single model in which strategic interactions lead to over-experimentation. It would be

interesting to examine under which conditions this might happen (heterogeneity?).

9

See also Rosenberg, Salomon and Vieille (2013) for a related analysis in continuous time for the two-player case.

10

See also Thomas (2015).

9

those deciphering the local alphabet. But you are not one of them. As it turns out, nobody is

ahead of you, so this is your lucky day: either the first cab will be for you, or this is not a spot

where cabs are authorized to pick up fares. What do you do?

Suppose instead that there are three people in front of you, so that you are fourth in line

(your upbringing compelling you to incur the requisite congestion cost if you choose to stay). It

is clear to you that their knowledge of the local idiom is as deficient as your own. Yet, having

come before you, they might have observed useful data, like a cab picking up somebody ahead of

them. Communication would certainly help, but interpersonal skills are not your forte (it may be

embarrassing to admit you?re standing in line with no assurance this is even a cab stand. So any

cheap talk by people in line is likely to be uninformative anyway). What do you do?

This problem looks daunting: incomplete information (a binary state), coarse information sets

(the number of people ahead of you in the line), and both informational and payoff externalities.

Yet, at least for some range of parameters, Cripps and Thomas solve for the optimal policy. One

of the key insights is that information is nested: the first person in line knows as much as everyone

else, since he must have come before. Hence, provided that the queue is not so long that you balk

immediately, it is optimal to mimic this first person’s behavior. If he ever balks, then so should

you, as it reveals that he hasn’t seen a cab, and you cannot hope for a higher payoff than his. Yet,

it isn’t always optimal to stay as long as he does. Cripps and Thomas (2015) do not provide a

general solution, unfortunately, but their analysis provides insights into what can go wrong and

suggests alternative candidate strategies.

In Strulovici (2010), the externality comes about from the decision-making process. Each player

faces an independent one-armed bandit. However, actions are not independent. All players must

take the same action at every instant, which gets decided by majority voting. It is a good news

model, and outcomes are observed.11 As a result, players are divided into two groups: those who

have found out that their arm is good, and stubbornly cast their vote for the risky arm (which is

no longer risky for them), and those who have not yet received a good signal and grow anxious that

the safe arm might be the better alternative. If the first group takes over, the uncertain voters will

be doomed to use the risky arm, which is most likely bad for them. If instead the second group

pre-empts this from happening by voting for the safe alternative while they still can, players from

the first group will be frustrated winners. How does this trade-off play out? As it turns out, it leads

to under-experimentation. This is not too surprising, as the existence of the first group imposes a

cost which the social planner does not account for, on uncertain voters that might be stuck with

a bad arm. Strikingly, as the number of voters grows large, experimentation is abandoned as soon

as the myopic cut-off is reached.

This analysis suggests interesting questions for future research. What if the bandits are corre-

lated, for instance, when outcomes or information is unobserved? In a jury context, the accused is

either guilty or innocent. In such environments, understanding under which conditions information

aggregation obtains seems an important problem.

3 Motivation

3.1 Many Options and Coarse Learning

Experimentation is often carried out by a third party whose interests are not necessarily aligned

with those of the experimenter. Nobody cares about being a guinea pig, yet it is in society’s best

interest that some of us are. Consider the example of Waze, as discussed by Kremer, Mansour,

11

See also Khromenkova (2015) for the analysis of the mixed news case.

10

M

1 2 3

I

−5 0 5

and Perry (2014, KMP). Every day, motorists must arbitrage between two routes. Travel time is

uncertain, and so is each motorist’s payoff πk , which is shared by the motorists traveling this route

that day. Let us assume that it is uniformly drawn from [ak , bk ], with k = M, I standing for the

type of the route: aI < aM < bM ≤ bI , so that the M erritt parkway is “safer” than I-95, but also

aM + bM > aI + bI , so it is on average faster. Motorists cannot postpone or forward their daily

commute, and arrive in a sequence. They do not care about experimenting, as travel times are

independently distributed across days, and they need to travel only once a day. However, upon

taking a route, their cell phone diligently relays travel time to Waze (and hence perfectly resolves

uncertainty for that day) unbeknown to the motorist.12

Consider first the case in which aM = 1, bM = 3, whereas aI = −5, bI = 5 (with the usual

stretch of imagination that negative payoffs demand). Our first motorist, an early bird aware of his

position in the line, selects the M -route, as its expected payoff of 2 exceeds zero, I-95’s expected

payoff. The second motorist’s choice is equally simple, independent of his knowledge of the first

motorist’s realized travel time. At worst, it is 1, which still beats the mean payoff from I-95. Hence

there is nothing he can be told that would change his mind: route M it is. More generally, if

E[πI ] < aM , all drivers take the Merritt parkway, and Waze loses its purpose (not to mention, to

the dismay of Waze’s innovators, Google’s attention). This is captured by Figure 5.

Instead, suppose that aM = −1 < bM = 5. The early bird selects route M , as before, because

its expected payoff exceeds I’s expected payoff. Let us first assume that later motorists observe

earlier ones’ choices and payoffs (if not, they all herd on the M -route). This is what is called

transparency here. If the first motorist’s realized payoff falls below 0 (say −0.5), then the second

motorist selects the I-route, resolving all residual uncertainty as far as the East Coast is concerned.

However, if the early bird’s realized payoff is 0.5, the second (as well as all later) motorist

mimicks the first, despite the potential benefits from the I-route. This is where Waze comes into

play. Suppose that Waze commits to the following policy: it discloses to the second motorist

whether the first motorist’s travel time is below or above 1, and nothing more. If the realized

payoff is 0.5, the second motorist is willing to experiment with the I-route nonetheless, as, on

average, it still does as well as the M -route (0).

What if the first motorist’s payoff is 1.5? The second motorist must be told that the early bird

did well (more than 1 as a payoff), and so he will follow suit. The third motorist, however, need

not be told what the second one was told. In particular, Waze can commit to tell him to take the

M -route in one of two events. Either the second motorist took the M -route, as did the first, and

their common realized travel time fell short of 2.33; or the early bird’s travel time fell short of 1,

12

As all stories, this one has practical limitations. In particular, choices do not affect travel times. This ignores

congestion. But as every user of I-95 or US-101 will attest, there are some things logic cannot explain.

11

b

M b

M

−1−0.5 2 5 −1−0.5 2 5

I I

−5 0 5 −5 0 5

b

M b

M

−1 0.51 2 5 −1 11.52.33 5

I I

−5 0 5 −5 0 5

Figure 7: Choices of the second and third agent under the optimal recommendation policy.

so that the second experimented with the I-route, and the M -route turns out to be the faster one.

Comparing these two possibilities, the choice of 2.33 as a threshold ensures that the third motorist

is willing to experiment with the I-route if told so. And if he is not told to experiment (and neither

were any of the first three), the fourth is willing to do so, so that, with four motorists or more, all

uncertainty can be resolved, while making sure that each motorist remains sufficiently uncertain

about the right choice that it is optimal for him to follow Waze’s recommendation. See Figure 7

for an illustration.

KMP develop this logic further. With two actions and perfect learning, as in this example, the

optimal recommendation mechanism has a simple partitional structure.13 The partition pertains

to the realized value of the interval [aM , bM ] (the safer action) into a collection of disjoint sets (not

necessarily a partition, and not necessarily non-empty) I1 , I2 , . . . , II with the property that, if the

realized payoff of the first motorist is in Ii , the i-th agent is the first to experiment with the action

with the lower mean (and all later agents make the right choice). Partition strategies need not be

optimal under stochastic learning, but KMP show that as the number of agents grows large, it does

very well nonetheless (it is asymptotically optimal).

Sometimes once is not enough. Learning takes time, the early bird’s reassurances notwithstanding.

Let us follow Che and Hörner (2016) and assume that payoffs are coarse –either high (1) or low

(0)– but learning is not. Specifically, if the state of the world is good, selecting the risky option is

13

The main focus of KMP is on the case in which the designer maximizes the expectation of the average (undis-

PI

counted) payoff of the agents, I −1 n=1 π i , where I is the number of agents and π i is agent i’s realized utility.

12

best for everyone, but few of those that have tried bother to let others know. Specifically, assume

that, if and only if the state is good, (positive) feedback arrives at a random time, exponentially

distributed with an intensity proportional to the (infinitesimal) mass of agents that consume at

that point in time. That is, the rate at which feedback arrives is

where λ > 0 is some velocity parameter, µt is the fraction that consumes at time t, and ω = 0, 1 is

the state. Each consumer consumes at most once. A fraction ρ > 0 of consumers have a positive

opportunity cost c ∈ (p0 , 1) (regular agents), where p0 is the probability that ω = 1. The remaining

agents (fans) have sufficiently unattractive outside options that they experiment no matter what,

although they are just as critical in providing feedback. Because c > p0 , these are the only ones

experimenting at the start. They are the seeds that jump-start the process. However, if Waze (or

its equivalent) simply relays the feedback that these agents provide, if any, the remaining ρ fraction

of agents remain on the sideline, unless positive feedback is reported. As a result, experimentation

is too slow. To accelerate it, the designer may spam these agents, by recommending a fraction αt

of these agents that they try out the product. (Of course, this requires that these agents rely on

the designer as their exclusive source of information.) The designer commits to this policy, and

selects at random this fraction of guinea pigs. If one of the early consumers has reported that

the payoff is high, the designer and consumers have aligned interests. If not, the designer’s belief

evolves according to

ṗt = −λ(1 − ρ + ραt )pt (1 − pt ). (3)

The designer can capitalize on the possibility that he has learnt the state to make spamming credible,

provided that spamming is meted out proportionally to the likelihood of this event. Conditional

on being told to consume, an agent’s expected utility from consuming is

1−p0 p0 −pt

1−pt αt pt + 1−pt · 1 (1 − p0 )pt αt + (p0 − pt )

1−p0 p0 −pt = ,

1−pt αt + 1−pt

(1 − p0 )αt + (p0 − pt )

given that (p0 − pt )/(1 − pt ) is the probability assigned by this consumer to the designer having

learnt the state by time t, and (1 − p0 )/(1 − pt ) is the complementary probability that he has not.

Setting this equal to c, and solving for αt , we obtain

(1 − c)(p0 − pt )

αt = . (4)

(1 − p0 )(c − pt )

Spamming is not credible at the very beginning (α0 = 0) because it is virtually impossible that

the designer has learnt something useful by then. But as time passes by, spamming becomes more

credible, and the rate of spamming accelerates. Equations (3) and (4) provide a pair of equations

that can be solved for the maximum spamming α, and the corresponding belief p.

However, if the designer is benevolent, spamming has to cease eventually. Persistent lack of

feedback from consumers reinforces his belief that the payoff is low, so that experimentation has

to stop. When? With only two types of agents, the threshold at which it stops coincides with

the cut-off p that prevails absent the incentive constraint that constrains the amount of spamming.

This is because, at the point at which experimentation with regular agents stops, the continuation

game is identical to the one without incentive constraints: only fans experiment. As a result, the

first order condition that dictates when spamming ceases is the same in both problems. Incentives

13

pt ,αt

0.5

0.4 pt

p

0.3 pt

0.2 αt = fraction

told to consume

0.1

αt

0 t

0 0.2 0.4 0.6 0.8

Figure 8: Experimentation rate and belief evolution.

constrain the rate of experimentation from regular agents, but when considering when to stop, this

rate is irrelevant for the marginal value of continuing for another instant.14

Overall, the pattern of experimentation over time is hump-shaped. Experimentation is most

valuable at the start, but this is when the agents’ suspicions prevent it from being in full swing.

It then rises gradually as the designer’s recommendations gain credibility, and finally tapers off as

growing pessimism means that high-cost agents should not be carrying out such experimentation.

See Figure 8.

An important assumption made so far in our discussion of these two papers is that recommen-

dations are confidential. If spamming is public (reviews, endorsements, etc.), the designer is further

constrained. In the last example, it is then optimal for the designer to behave as follows. If he gets

feedback at any time, he tells everyone to experiment from that point on. If he fails to, he picks

at random a time at which he claims that he did, and gets everyone to experiment from that point

on, but only until some later time at which, if he still hasn’t heard positive feedback, he concedes

that his claim was unwarranted, and experimentation (from regular agents) stops. This spamming

occurs at most once. The distribution over the random time is chosen to make experimentation

start as early as possible, subject to the claim that he received feedback being sufficiently credible

that agents are willing to follow his recommendation over the next phase.

The models we have discussed so far assume that the principal’s objective is to maximize

social welfare. This may be the right objective for many recommender systems (like Waze, news

websites recommending stories to read, or review websites recommending places to visit or apps to

download) since the equilibrium with full information disclosure is likely to have inefficiently low

experimentation because of the positive externality learning by early consumers has on later ones.

That said, if the recommender’s objective is to maximize the number of users (for example, because

they can monetize traffic to their website) then the objective function can become more complicated,

especially in the presence of competition between recommender platforms. That problem has been

studied by He (2016). She considers differentiated platforms and agents making single-homing

14

With more than two types, this reasoning carries over to the lowest type among those agents with a positive

opportunity cost. But it does not apply to those with a higher cost. Because the rate of experimentation is too slow,

the designer optimally chooses to spam such agents a little longer than he would absent incentive constraints. His

reduced experimentation “capacity” in the continuation makes their willingness to partially experiment valuable, and

he spams at beliefs below those at which they should in the unconstrained problem.

14

decisions which one to join based on the experimentation and recommendation policies the platforms

follow.

When agents choose platforms without knowing whether they will be early or late in the queue

for recommendations, the optimal policy is to maximize efficiency of dynamic learning, consistent

with what we have described so far. However, if agents have some information regarding whether

they are likely to be early or late, then a new economic force appears. A platform attracts more early

users by recommending the myopically optimal product/route instead of the dynamically-efficient

one (that takes into account the option value of learning and making better recommendations to

future agents) in order to attract more early customers. The platform with a larger early market

share can obtain a competitive advantage in the future because learning from more early consumers

can lead to superior recommendations to future customers. He (2016) shows that a merger of the

platforms can make the recommendation system more efficient, because a monopolist can internalize

the business stealing effect.

4 Delegation

Information is only one of the many tools available to the principal. Authority is another. In

two very different environments, Guo (2016) and Grenadier, Malenko and Malenko (2015, GMM)

examine the scope for and the structure of delegation in dynamic problems. In both papers, the

direction of the conflict of interest (expressed in terms of timing of the optimal decision) changes

the answer drastically. To understand why, we start with a simple example.

Consider a simple cheap-talk model based on Crawford and Sobel (1982, CS).15 We assume that the

state is uniformly distributed on [0, 1], and that preferences are quadratic. The receiver is unbiased,

with preferences −(t − y)2 , where t is the state of the world (the sender’s type) and y ∈ R+ is the

action by the receiver. The sender has preferences −(t − (y + b))2 , where b ∈ R. We distinguish

(i) positive bias, b > 0, and (ii) negative bias, b < 0. Throughout we assume |b| < 1/2, for the

problem to be interesting (babbling is the commitment solution otherwise). In CS, this distinction

is irrelevant, as it is a matter of normalization. Here instead, we assume that t also indexes time,

and that, as time passes by, opportunities vanish. That is, if the receiver acts at time t, his available

actions are elements of Yt = {y ∈ R : y ≥ t}. Delaying is always an option, but acting can only take

place once (it ends the game). We ignore any discounting or cost of delay, the possibility of never

taking an action (add a large penalty for doing so) and leave aside the technical modeling details

regarding continuous-time strategies. Let us think of the sender being able to send a message in

[0, 1] at any time t and impose standard measurability restrictions.

Because time flows from “left to right,” we will argue that the positive and negative bias cases

are very different, that is, admit very different sets of equilibria.

Positive bias (bias for delay). In that case, we claim that the receiver can do as well as

under commitment in the standard model of CS in which types have no temporal meaning. The

commitment case has been solved by Holmström (1984) and in greater generality by Melumad and

Shibano (1991). It involves setting:

(

t+b if t ≤ 1 − 2b,

y(t) =

1−b if t > 1 − 2b.

15

A related example appears in a preliminary draft by Gordon and de Villemeur (2010).

15

y

y =t+b

1

1−b

0 1 − 2b 1 t

See Figure 9. We focus on the equilibrium that is best for the receiver.16

Why does this require no commitment? Time endows the receiver with the necessary power.

When the sender tells him to act at time t ≤ 1 − b, the receiver learns that the state is t − b, but

this action is no longer available. So the best he can then do is to take action t. Similarly, once

time 1 − b arrives, he knows that the state is in [1 − 2b, 1], and his expected action 1 − b matches

what he would do in the next split second had perfect revelation continued –given that it won’t, it

remains optimal in the absence of information.

Note that (unlike in the commitment case) there are lots of other equilibria. For instance,

babbling is always an equilibrium, and so are many other messaging strategies.

Negative bias (bias for action). With a negative bias, it is clear that the commitment optimum

(which now involves a “floor” concerning types in [0, 2|b|], pooled at b, and perfect separation over

high types, who obtain t − |b|) is no longer incentive compatible. If a type t ≥ 2|b| reveals his type,

the receiver can take action t if he can take action t − |b|, so that sequential rationality prevents

him from taking the sender’s optimal action.

Does this mean that we are back in the world of CS? In addition to the equilibria of CS,

additional ones arise because of the role of time. For instance, consider a two-partition equilibrium,

in which types in [0, t1 ] get mapped into action y1 > t1 /2 and types in (t1 , 1] get mapped into

y2 = (1 + t1 )/2. This can be achieved by waiting till time y1 (messages are interpreted as “babble”

until then), and then senders separating at time y1 between those whose type is above t1 and those

whose type is below.17 Type t1 must be indifferent, which requires

16

In CS, ex ante payoffs are aligned and so this also gives the highest payoff to the sender. But this is only true

for any allocation in which the receiver takes his preferred action given his information (which is not necessarily the

case in our environment, in which there might be a lower bound on the action he can take).

17

If senders do so by randomizing over [0, t1 ] or (t1 , 1] depending on whether their type is above or below t1 at

time y1 , all messages are on path, so that what would happen if the sender deviated at time t1 is irrelevant –the

receiver will act at that time. And the receiver has nothing to gain from waiting any longer, if he expects no further

meaningful information to be transmitted.

16

The receiver must prefer waiting until time t1 to acting at time 1/2 (his favorite in the absence of

any information), namely,

Z 1 Z t1 Z 1

(t − 1/2)2 dt ≥ (t − y1 )2 dt + (t − y2 )2 dt

0 0 t1

or

4t21 − (3 + 16b)t1 + 8b(1 + 2b) ≤ 0,

an expression that reduces to 1/2 + 2b when t1 = (1 + 4b)/2, the value of t1 given the partition

{[0, t1 ], (t1 , 1]} in the original CS-model. Hence, for b < 1/4 (the usual condition for non-babbling

equilibria to exist in CS), we have a range of values of t1 , namely,

1 1 1 √

t1 ∈ + 2b, + 2b + 9 − 32b − 1 ,

2 2 8

for which such equilibria exist (this interval is non-empty by our assumption that b < 1/4).18 But

note that they are all worse than the original two-partition equilibrium of CS, as the higher value

pushes t1 above what it would have been if y1 = t1 /2. The problem is that the lower interval is

already too large, compared to the higher one.

It turns out that in our uniform-quadratic set-up, this is a general feature: no pure-strategy

equilibrium involving one-way communication improves on the best CS equilibrium.19 This is

because, first, and as in CS, an equilibrium must be partitional. Because the sender’s preferences

satisfy the single-crossing property, if types t < t′ are both mapped into the same outcome y, then

so must all types in the interval [t, t′ ]. Letting [t, t′ ] be the maximum interval associated with y,

either y = (t + t′ )/2 because the receiver finds out about the sender being in that range before

this time arrives, or y > (t + t′ )/2 because he learns about it at time y. Surely, the type y + |b|

is in the interval [t, t′ ] and so each interval is of length at least 2|b|, and so there are finitely many

intervals, I1 , . . . , IK . The action associated to the highest interval, IK , must be its midpoint, as it

is best for the receiver to take if the time comes. Consider the problem of maximizing the receiver’s

expected utility over sequences {y1 , . . . , yK }, t0 = 0, . . . , tK = 1, such that each yk is at least as

high as the midpoint of Ik , and each tk − |b| is equidistant from yk and yk−1 . It is easy to see that

it is maximized at the midpoints.

Guo and GMM consider two dynamic principal-agent models in which there is uncertainty (an

unknown state of the world) and risk (imperfect learning via the observation process, though in

GMM the principal only learns via the agent’s reports). In both models, the agent has private

information that does not evolve over time. The principal and the agent have a conflict of interest.

In Guo, it is an experimentation model in which the agent has a vested interest in playing risky

or safe (depending on his bias), but also private information at the start regarding the viability

of the project. In GMM, the principal faces a problem of real options (an irreversible investment

decision) based on a publicly observable signal process (an exogenous geometric Brownian motion)

that relates to the project’s profitability via a factor that is the agent’s private information.20

18

This class of equilibria involves the standard CS equilibrium of partition size 2. The same argument applies to

any CS equilibrium, independent of its partition size.

19

Of course, one can do better with either a randomization device or the receiver talking as well, along the lines

of long cheap-talk (Aumann and Hart, 2003), which becomes especially natural in this dynamic environment.

20

The main difference isn’t Poisson vs. Brownian uncertainty. Rather, it lies in the updating. In Guo, the public

signal is informative about the agent’s type, whereas as in GMM, it is not, which simplifies somewhat the updating.

17

In both models, the principal designs an optimal mechanism, with or without commitment.

The principal has authority, and can dictate the agent’s choices based on his reports. It matters

whether the agent’s bias leads him to favor delaying the (optimally, irreversible) decision or not. If

he enjoys a longer phase of learning, the principal has to counteract the agent’s incentives to act

too late; if he enjoys a shorter phase of learning, inefficiently early stopping/investment have to be

prevented. This is the dichotomy already present in the simple example of the previous section.

In the case in which for any given prior belief the agent prefers to switch to the safe arm

later than the principal, Guo shows that the optimal policy (from the principal’s viewpoint) can be

implemented by a sliding rule: based on the observed history, in which experimentation was pursued

throughout by the agent, the principal updates his belief as if the agent was equally uninformed

regarding the project’s potential. There is a cut-off (in belief space) that he picks at which he

dictates the agent to stop. Until then, the agent is free to stop on his own, which might happen if

the agent’s private information is so bad that earlier termination might be preferable from his point

of view despite his bias. The simplicity of this rule is remarkable. Furthermore, as in our example,

and for the same reasons, the principal need no commitment to implement it. If he were free to

grant authority to stop or continue experimentation at every moment, he would find it optimal to

leave it up to the agent to decide up to the moment at which this threshold is being crossed.21

In GMM, the bias is exactly as in CS: upon exercise of the option to invest, the agent receives

an additional b to the investment payoff. Positive bias means the agent is biased towards early

exercise. As in our example, this implies that delegation dominates communication (at least in

the simple class of equilibria with one-way communication, as described above). Indeed, GMM

show that such (stationary) equilibria involve a partition of the possible values of the signal process

into intervals. The agent recommends that the principal invest as soon as the observable process

enters a high enough interval, where “high enough” depends on his information. At that point, the

principal invests. In contrast, when bias is negative, so that late exercise is favored by the agent,

the outcome of delegation, which often involves full revelation, can also be achieved in the game

with cheap talk and no commitment.

5 Information Design

In this section, we abstract from exogenous learning. In the absence of a message by the informed

player, the other player does not learn anything. At the same time, we enrich the model by allowing

the state to change over time, and by endowing the informed player with a richer set of verifiable

messages.

A decision-maker (DM) must decide whether to invest (a = I) or not (a = N ) in a project with

uncertain returns. Her decision is a function of her belief in the state of the world ω = 0, 1, with a

return r(ω). Doing so is optimal if and only if p = P[ω = 1] ≥ p∗ . Accordingly, the invest region is

I = {p : p ≥ p∗ }, see Figure 10. Unfortunately, the DM does not observe the state, and must rely on

an advisor’s goodwill to get information. The advisor receives a fixed payoff of 1 from investment.

Our players are trained statisticians, not economists: they have no idea about the potential benefits

of contractual payments, but know all about statistical experiments à la Blackwell. In particular,

the advisor can commit to a splitting of his prior belief into an arbitrary distribution over posterior

21

Guo also considers the case of bias toward early action, and also finds that the optimal contract with commitment

is time-inconsistent and so can?t be implemented without commitment.

18

p0

0 N p∗ I 1

beliefs (a mean-preserving spread), with the DM only observing the realized posterior.22 If the

game is one-shot, the best experiment from the advisor’s point of view is rather immediate. To

maximize the probability that the DM invests, it is best to make the posterior belief that leads

to an investment to be such that the DM is just indifferent, that is, equal to p∗ , and the other

posterior equal to 0.23 See Figure 10. The resulting payoff is p0 /p∗ (more generally, min{1, p0 /p∗ },

as p0 ≷ p∗ ). This is the celebrated concavification formula of Aumann and Maschler (1995), which

states that the maximum payoff over possible splittings is equal to the smallest concave function

no less than the payoff function given a belief (namely, min{1, p0 /p∗ } = cav1{p≥p∗ } (p0 )).

We call the corresponding strategy greedy. The greedy strategy is the strategy that minimizes

information revelation subject to maximizing the one-shot payoff. Formally, it is defined for environ-

ments in which there is a binary action (such as investing or not) and the payoff to be maximized

(the advisor’s) is independent of the true state, although it may depend on the action. Let us

define the region of beliefs I for which investing is optimal from the DM’s point of view (here,

[p∗ , 1]). Then the greedy strategy specifies that no information be disclosed if p ∈ I. Otherwise,

it specifies that the DM be told one of two posteriors pI and pN , with weights λ and 1 − λ (so

p = λpI + (1 − λ)pN ), with pI ∈ I, in a way that maximizes the weight λ ∈ [0, 1]. Such a greedy

strategy is generically unique and can be solved using linear programming.

Now suppose that the investment decision is repeated twice, yet the state is persistent: it

remains the same with probability ρ ≥ 1/2, and it switches with probability 1 − ρ. Hence, the

posterior belief, φ(p) lies somewhere between p and 1/2, depending on ρ. The advisor’s payoff is

1{a1 =I} + β1{a2 =I} , where an is the action in round n, and β ≥ 0 is the weight assigned to the

second round.

To apply Aumann and Maschler’s formula, we must compute the payoff π(p), where p is the

posterior belief of the DM after the splitting, but before the investment decision in the first round.

Because at that stage, the DM invests if and only if p ≥ p∗ , π(·) jumps up by 1 at p∗ , but this

fixed benefit is sunk and simply added to what accrues from the second period. If p ≥ φ−1 (p∗ ),

then no information should be disclosed in the second period, because φ(p) ≥ p∗ . In contrast,

disclosing additional information is optimal if p < φ−1 (p∗ ). As is clear from the left panel in Figure

11, the graph of the resulting payoff differs markedly according to whether p∗ ≷ 1/2 (equivalently,

p∗ ≷ φ−1 (p∗ )). In case p∗ > 1/2 the optimal strategy is greedy for all β. The intuition is as follows.

The best case scenario for period 2 payoffs is when the agent reveals nothing in period 1. In that

case, since φ(p) < p∗ , the optimal disclosure in period 2 is to split the belief between p∗ and 0.

Now consider the agent following the greedy strategy in the first period, which also induces beliefs

0 and p∗ . After that, whatever the interim beliefs, in the beginning of the second stage, the beliefs

are less than p∗ and in both cases it is optimal to induce posteriors of 0 and p∗ . Since in that

22

Note that the advisor need not observe the realized state, but somehow must have access to a technology that

generates any posterior distribution.

23

Such static problems are considered in greater generality by Rayo and Segal (2010), and Kamenica and Gentzkow

(2011).

19

π(p) π(p)

1+β

cavπ

β

cavπ

range the payoff function is linear in beliefs, there is no cost for second-period payoffs if the agent

is greedy in the first period. In contrast, if 1/2 > p∗ , greedy is optimal only if β is sufficiently small,

because if p∗ < 1/2 then if the belief at the end of period 1 is p∗ then the belief at the beginning

of period 2 is above p∗ and hence disclosure in period one creates a probability distribution over

period-2 beliefs on a concave part of the value function. So in this case there is a tradeoff between

maximizing period 1 and period 2 payoffs.

Figure 12 shows why the greedy strategy is no longer optimal once β is large enough. For

p < φ−1 (p∗ ), the optimal strategy consists in releasing just enough information for the posterior

belief to be equal to φ−1 (p∗ ). Hence, in this example, the optimality of the greedy strategy breaks

down if the future is important enough, and beliefs drift towards the interior of the Investment

region. This is easy to understand: if beliefs drift outwards, as on the right panel of Figure 11,

then both splittings according to the greedy strategy, and movements due to mean-reversion keep

all beliefs inside the region over which π is linear. Given the martingale property of beliefs, this

ensures that there is no loss in using the greedy strategy. Once beliefs drift towards the investment

region, splitting the belief all the way to p∗ leads to a cost for the future, as mean-reversion pushes

the posterior belief into a range over which π no longer shares the same slope than in the range

where the prior belief lies.

How robust is the optimality of the greedy strategy? Remarkably, Renault, Solan and Vieille

(2015, RSV) show that, in the infinitely (discounted) repeated investment game, the greedy strategy

is optimal for an arbitrary two-state irreducible Markov chain. Despite the fact that the future may

outweigh the present, stationarity ensures that the scenario depicted in Figure 11 does not occur,

or rather that, whenever the intertemporal cost cannot be avoided, the greedy strategy achieves

the best trade-off.

Despite this remarkable result, the conditions under which the greedy strategy is optimal are

very special. Suppose, for instance, that some extraneous public signal, obtained at the beginning

of each of the two rounds, determines whether p∗ = 0.2 or 0.8, with both being equally likely, but

independent across rounds. To simplify, we assume that ρ = 1, so that the state is the same in both

rounds. Let p0 = 0.2. Suppose that p∗ = 0.8 in the first round. By not disclosing any information

in the first round, the advisor receives a payoff of β/2, because the prior belief suffices tomorrow,

with probability 1/2. In contrast, by following the greedy strategy, the advisor splits the prior into

0 (with probability 3/4) and 0.8 (with probability 1/4), resulting in a payoff of (1 + β)/4 < β/2 if

β > 1. Hence, the greedy strategy is not optimal if the future is sufficiently important. If the DM

20

π(p)

1+β

β

cavπ

invests once and only once, the threshold patience further decreases to β ≥ 1/2.

This example relies on p∗ being random, which implies that the relevant distance between the

prior belief and the boundary of the investment region might become more favorable by waiting,

rather than giving away information. The same phenomenon arises without randomness, by simply

considering more states.

Indeed, RSV construct the following three-state counter-example, illustrated in Figure 13. The

horizon is infinite, and the discount factor is δ < 1. The DM invests in the light-gray area, the

triangle with summits ω ∗ , ω1 , εω1 + (1 − ε)ω2 , where ε > 0 is a small number. Suppose that the

prior is p1 and the state’s persistence is 1/2, so that given the invariant distribution ω2 , the belief

absent any disclosure after one round is

1 1 1 1

p1 + ω2 = εω1 + ω2 + ( − ε)ω3 .

2 2 2 2

The greedy strategy calls for a splitting between ω3 , and ω ∗ , with probability 4ε on ω ∗ . The

resulting payoff is 4εω ∗ < 4ε. By waiting one round and splitting the resulting belief between

p2 (with probability 1/(2(1 − ε))) and p3 , the resulting payoff is 1/(2(1 − ε)), so that, as long as

4ε < δ(1 − δ)/(2(1 − ε)), the latter strategy dominates.

5.2 Beeps

Ely (2015) tells the following relatable story of “beeps” to motivate the optimal design of information

disclosure. A researcher clears her email inbox and begins to work productively at her desk. She

believes that emails arrive according to a known stochastic process, and she feels compelled to stop

working and check her email when her belief ν that a new email has arrived is strictly greater than

some threshold p∗ ∈ (0, 1). Assume the researcher gets utility from working but not from checking

email since she will become distracted on her computer (so in this stylized example, her compulsion

to check email is purely pathological). Her discount rate is r > 0.

The researcher can set up her computer to produce an audible beep when an email is received.

Perhaps surprisingly, this “distraction” may actually increase the expected time until the researcher

checks her email. The beep will compel the researcher to immediately check her email (since her

posterior belief µ will jump to 1), but the knowledge that the beep mechanism is enabled (and

functional) allows her to work indefinitely with the certainty that no email has arrived, as long

as she has not heard a beep. Many other beeping mechanisms are imaginable, so the question

21

ω1 b

I

b ω ∗ = 21 ω1 + 12 ω3

b

p1 = 2εω1 + (1 − 2ε)ω3

p2 = εω1 + (1 − ε)ω2 b

b

b p3 = εω1 + (1 − ε)ω3

εω1 + 12 ω2 + ( 21 − ε)ω3

b b

ω2 ω3

naturally arises: What informational filtering policy maximizes the expected discounted time until

the agent checks her email?

The set of states is Ω = {0, 1} indicating whether or not an email has arrived, so state 1 is

absorbing. Emails arrive at Poisson rate λ, but we work in discrete time with periods of length

∆. Hence, the transition probability from state 0 to state 1 is m := 1 − e−λ∆ . Let µt (νt ) denote

the agent’s posterior belief, at time t before (after) receiving the principal’s message, that an email

has arrived. According to Bayesian updating, when the agent does not receive new information her

beliefs obey the following law of motion:

In the initial state, we assume no emails have arrived so it is common knowledge that µ0 = ν0 = 0.

The agent checks her email iff νt > p∗ . The principal’s utility can be expressed directly as a function

of the agent’s posterior belief: (

1 if ν ≤ p∗ ,

u(ν) =

0 if ν > p∗ .

The principal’s discount factor is δ = e−r∆ . The principal commits to an information disclosure

policy, so the agent always interprets the principal’s messages correctly.

The solution to this problem is an immediate corollary of the optimality of the greedy strategy,

as established by RSV.24 Here, it is straightforward to compute.

Case 1: µ ≤ p∗ . Since the agent is taking the desired action (working), the principal releases no

information.

24

Since this model of beeps satisfies the conditions of Theorem 5 in RSV, it follows that the greedy strategy is

optimal for any initial distribution, in particular, for the choice of µ0 = ν0 = 0. Indeed, this example of beeps

is formally equivalent to the investment model in RSV in which the agent’s state-dependent payoff from investing

p∗

r : Ω → R is given by r(0) = 1−p ∗ and r(1) = −1. For then the agent invests and hence the principal receives payoff

1 if and only if 0 ≤ r(ν) = −ν + (1 − ν)p∗ /(1 − p∗ ), i.e., ν ≤ p∗ .

22

Case 2: µ > p∗ . The principal chooses the greedy binary splitting at µ, i.e., the agent solves

max aI under the constraints µ = aI pI + aN pN , over probabilities pI , pN , and weights aI , aN , with

pI ≤ p∗ , aI , aN ≥ 0, aI + aN = 1. It is easy to see that the maximum is achieved by pI = p∗ and

pN = 1, so that

1−µ µ − p∗

aI = , aN = .

1 − p∗ 1 − p∗

Therefore the principal should release two different messages, once that keeps the agent’s beliefs at

p∗ and one that sends the agent’s belief to 1.

Just because the greedy strategy is not robust does not imply that there are no interesting

economic settings in which it is optimal. Ely provides a few such examples. More generally,

however, little is known on the structure of the optimal strategy. Plainly, the principal’s problem is

a standard Markov decision problem, where the state variable is the agent’s posterior belief, and the

control is any splitting over beliefs. Hence, in the discounted case, it suffices to solve the optimality

equation

V (µt ) = max Ep [(1 − δ)u(νt ) + δV (f (νt ))],

p∈∆(∆(Ω))

Ep νt =µt

with the obvious definitions. Well-known numerical methods exist to solve such a functional equa-

tion (e.g., value iteration, the method used by Ely; but also policy iteration, linear programming,

etc.). Perhaps structural properties of the optimal policy can be found in some interesting classes

of models (supermodular preferences, first-order stochastic dominance on the state transitions).25

Consider now the game between a firm and a regulator analyzed in Orlov, Skrzypacz and Zryumov

(2016) (OSZ). The firm sells a product (say, a medical device) that is either safe or unsafe. While

the product is on the market, the firm and the regulator observe noisy information about user

experience, and update their beliefs about the safety of the product. The regulator has the power

to recall the product from the market, at which point learning stops (or the reputation of the

product is sufficiently tarnished that the firm cannot re-introduce it to the market without a major

redesign). From the regulator’s point of view, this is a real options problem. The firm and regulator

incentives are partially aligned. They agree that only a safe product should be left on the market.

They differ in their patience, or relative costs of type-one and type-two errors. Assume that the

optimal stopping belief for the firm would be higher than for the regulator, so that the firm would

like to experiment longer than the regulator. While the regulator has the decision power, the

firm has control over the collection of additional information. Namely, there is a binary random

variable that is correlated with the safety of the product and the firm can perform experiments to

reveal information about that additional signal. The information collection technology is as before:

the firm freely chooses a posterior distribution over the signal, subject to a martingale constraint.

Importantly, whatever the firm learns, it has to disclose, even if it is ex post not in its best interest.

As a result, even though information is valuable, the firm may decide not to learn some information

(at least for a while) to persuade the regulator to wait a bit longer.

25

By now, there are some papers deriving the optimal policy in some specific contexts. Smolin (2015), for instance,

considers a related problem, with the twist that the agent’s action affect transitions.

23

There are several differences between this model and the two discussed previously. First, the

firm and the regulator have no commitment power. The paper analyzes Markov Perfect equilibria,

with the joint belief about the safety of the drug and the realization of the noisy signal as state

variables. Second, the sender’s preferred action depends on the state. Third, the regulator’s strategy

is not a fixed belief cut-off, but a best-reply to the firm’s information strategy. Finally, beliefs move

stochastically in response to public news, even in the absence of disclosure by the receiver.

The resulting equilibrium dynamics depend on the size of the conflict between the firm and

regulator. If the conflict is small (as measured by the difference in autarky threshold beliefs, that

is, the optimal stopping beliefs when only public news are available), then the firm postpones

learning about the additional signal until a belief is reached, at which bad news about the signal

moves the posterior belief about the safety to the firm’s optimal exercise point. At that belief it is

an optimal strategy for the firm to fully reveal the additional information: if the news is bad, it gets

its optimal stopping implemented; if it is good, the regulator postpones a recall as much as possible.

In equilibrium, the regulator has ex post regret for waiting too long if the news is bad. But if the

news is good, the posterior is strictly below his exercise threshold, so waiting is optimal ex post.

In equilibrium, information disclosure is inefficiently late from the regulator’s viewpoint. The firm

manages to make the regulator wait (for some parameters, even pass his autarky threshold) because

the discrete full revelation of the signal is credible and valuable to the regulator. In equilibrium,

some recalls are forced by the regulator (after good news about the signal followed by bad public

news) and some are voluntary (after bad signal realizations), so the firm is able to persuade the

regulator into a form of a compromise.

When the difference in preferences between the two players is large, waiting for such full infor-

mation disclosure is no longer sufficiently valuable to the regulator, and the equilibrium changes

dramatically. When the regulator’s belief reaches the autarky threshold, the firm “pipets” informa-

tion about the signal, leading to either a reflection of beliefs at this threshold, or a discrete jump

to reveal bad news. That is reminiscent of the greedy strategy above: the firm reveals the smallest

amount of information to keep the regulator in the “no-recall” region, an equivalent of the “invest-

ment” region in the previous papers. Such persuasion is profitable for the firm (it manages to keep

the product longer on the market), but somewhat paradoxically has no value for the regulator (his

equilibrium payoff is the same as if no extra signal was available). The intuition comes from the

real-options literature: the regulator’s autarky value function satisfies smooth pasting condition at

the exercise threshold. Since the firm’s optimal strategy “pipets” information, it induces beliefs

only on the linear part of the regulator’s value function and that creates no value.

This stylized model can also reflect internal organization problems in which there is a conflict of

interest about timing of exercise of a real option (as in Grenadier, Malenko, Malenko (2015) or Guo

(2016)) and one agent has decision rights and another one can credibly disclose additional noisy

information about the state. In some cases, the agent would like to exercise the option sooner (for

example, if it is a decision when to launch a product). OSZ show that the agent then immediately

reveals all information about the noisy signal and can be hurt by having access to the information.

If the receiver expects full immediate information disclosure, it is rational for him to wait. Hence,

an impatient sender is better off revealing the information immediately.

6 Richer Worlds

A significant shortcoming of all papers reviewed so far is the restriction to two independent arms.

This is probably not the best way to think of the process of discovery and innovation. There are

not simply two ways to go, but many, and trying out a new technique close to an existing one is

24

W

W (p)

slope = µ

b W (p∗0 )

p∗0 p

likely to yield similar results. In contrast, trying out a very different technique is risky, but may pay

big. Several models attempt to model such environments. These include Akcigit and Liu (2015),

Bonatti and Rantakari (2016), Garfagnini and Strulovici (2016), Callander and Matouschek (2015).

Here, we showcase Callander (2011).

Callander introduces the following model of learning by trial-and-error. There is a continuum

of products that can be tried, p ∈ R. Each product is mapped into some outcome, W (p) ∈ R, and

entails a cost W (p)2 .

At the start, the value of one and only one product is known, p∗0 , with value W (p∗0 ) > 0.

The player’s belief about W (·) is modeled as a two-sided Wiener process. That is, we have two

independent one-dimensional Wiener processes W− = (W− (p) : p ≥ 0) and W+ = (W+ (p) : p ≥ 0),

with parameters (−µ, σ) and (µ, σ) respectively, and initial value W− (0) = W+ (0) = W (p∗0 ), and

we define (

W+ (p − p∗0 ) if p ≥ p∗0 ,

W (p) =

W− (p∗0 − p) if p < p∗0 .

It is assumed that µ < 0. See Figure 14. Trying out a given p reveals the corresponding value of

W (p).

What values of p should be sampled? This is a difficult problem, already interesting when the

agent is myopic (a sequence of entrepreneurs, each with one chance only), the focus of Callander’s

analysis. In a given round n, after history hn , the entrepreneur chooses p so as to minimize

E[W (p) | hn ]2 + Var[W (p) | hn ].26 Let p̄, p denote the highest and lowest value of pm for m < n

along that history. Because µ ≤ 0,

Hence, by induction, no value of p < p∗0 is ever chosen. Note that the same argument implies that

if W (p̄) ≤ 0, then choosing a product p > p̄ is strictly worse than choosing p̄. Hence, if a product

p is ever chosen so that p = p̄ and W (p) ≤ 0, no product to the right of p̄ will ever be sampled.

Next, notice that pm < pm′ are consecutive products (that is, no product in between their values

has been chosen along that history), and sgn W (pm ) = sgn W (pm′ ), then

p ∈ (pm , pm′ ) ⇒ E[W (p) | hn ]2 + Var[W (p) | hn ] > min{W (pm )2 , W (pm′ )2 }.

26

This is because E[X 2 ] = E[X]2 + Var[X].

25

Hence also, if pm < pm′ are tried products, with sgn W (pm ) = sgn W (pm′ ), and all products

p ∈ (pm , pm′ ) that were ever chosen are such that sgn W (p) = sgn W (pm ). Then, if a product is

ever chosen again in the interval (pm , pm′ ), it must be a known product, namely a product that

minimizes the cost among those known products.

This leaves us with two cases: (i) trying a product p > p̄, and (ii) trying a product in an interval

(pm , pm′ ) of consecutive products such that sgn W (pm ) 6= sgn W (pm′ ). Note that in that case it

must be that W (pm ) > 0 > W (pm′ ) (if one is 0 it is clearly optimal to choose it forever). This is

because, by our previous observations, products to the right of a product with sgn W (p) = −1 are

never chosen. Hence, the only configuration that can arise in which there are tried products with

both negative and positive values must be such that all products with negative values lie to the

right of the products with positive value.

Choosing a product p > p̄: Consider first trying a product at the “frontier,” that is, larger

than any product tried so far. As discussed, we may assume W (p̄) > 0, otherwise such a choice

cannot be optimal. We must solve

n o n o

min E[W (p) | hn ]2 + Var[W (p) | hn ] = min (W (p̄) + µ(p − p̄))2 + (p − p̄)σ 2 ,

p≥p̄ p≥p̄

σ2

µ(p − p̄) = − − W (p̄),

2µ

which is positive if and only if

σ2

W (p̄) > − =: α.

2µ

(Recall that µ < 0). Hence, such a choice is only a candidate to optimality if W (p̄) > α; if it is

chosen, it is set so that

E[W (p)] = W (p̄) + µ(p − p̄) = α.

In particular, this must be the optimal choice p∗1 in the first period n = 1 if W (p∗0 ) > α. (Interest-

ingly, p∗1 is single-peaked in µ.) See left panel in Figure 15. Otherwise, pn = p∗0 for all n ≥ 1, as the

first outcome is “good enough.”

Note that the cost of choosing this candidate is (plugging in the first-order condition into the

objective)

2αW (p̄) − α2 .

Hence the relevant comparison is whether this is larger than the cost of the best alternative so far.

In particular, this choice is dominated by the product p = arg min{W (pm )2 : m = 0, . . . , n − 1} if

and only if

W̄ 2 + α2

W (p̄) > ,

2α

where W̄ = W (p).

Choosing a product p ∈ (pm , pm′ ), where sgn W (pm ) 6= sgn W (pm′ ): Recall that we may

assume that no product in the interval (pm , pm′ ) has ever been tried so far, and that W (pm ) > 0 >

W (pm′ ). This means that we can assume that

W (pm′ ) − W (pm )

< −µ.

pm′ − pm

26

W W

W (p∗1 )

b

slope = µ

b W (p∗0 ) b

slope = µ

W (p∗0 )

α α

−α −α

Figure 15: Exploration phase.

That is, the line connecting the points (pm , W (pm )) and (pm′ , W (pm′ )) is steeper than the drift

line of the Wiener process. This is because of our earlier analysis: when pm′ was chosen, it was

picked either as the rightest product so far, in which case the resulting negative value (relative to

a positive expected value of α) means that the slope is indeed steeper. Or it was not the rightest

product, but then it was to the left of the first product that resulted in a negative value, meaning

that the slope is even steeper.

We must minimize

2

p − pm (p − pm )(pm′ − p) 2

W (pm ) + (W (pm′ ) − W (pm )) + σ ,

pm′ − pm pm′ − pm

which gives as candidate the solution to

n p − pm σ2

E[W (p) | h ] = µ(pm , pm′ ) 1 − 2 , where µ(pm , pm′ ) := − W (p ′ )−W (p ) ,

pm′ − pm 2 m m

pm′ −pm

Of course, one must additionally compare this expected payoff to the payoff W̄ of the best

product so far. The specific conditions are tedious, but with probability one, this process (of

“triangulation”) stops. See Figure 16.

To summarize, the game always starts with an exploratory phase, in which new products (to

the right of the existing ones) are chosen. This process might stop, with the agent settling on the

one with the best performance (which need not be the one furthest to the right). If at some point

in the exploratory phase, a negative outcome results, the policy shifts to a triangulation phase,

picking either a new product within the unique interval of products over which the realized values

change sign, or the best product so far. Eventually, search stops.

It would be interesting to see how these dynamics change when entrepreneurs are forward-

looking. Solving Callander’s model for patient agents is an important open problem in this area.

27

W W

W (p∗t−3 )

b

W (p∗t−3 )

b

b b

W (p∗0 )

b

W (p∗t−2 ) W (p∗0 )

b

W (p∗t−2 )

α α

p∗0 p∗t−3 p∗t−2 p∗t p∗t−1 p p∗0 p∗t−3 p∗t−2 p∗t p∗t−1 p

−α −α

W (p∗t−1 )

b b

W (p∗t−1 ) slope = µ W (p∗t ) b

slope = µ

Figure 16: Triangulation phase, first and second step.

References

[1] Akcigit, U., and Q. Liu (2016). “The Role of Information in Innovation and Competition,”

Journal of the European Economic Association, 14, 828–870.

[2] Aumann, R.J. and S. Hart (2003). “Long cheap talk,” Econometrica, 71, 1619–1660.

[3] Aumann, R.J., and M. Maschler (1995). Repeated Games with Incomplete Information, MIT

Press.

[4] Bimpikis, K., S. Ehsani, and M. Mostagir (2016). “Designing Dynamic Contests?,” working

paper.

[5] Bolton, P., and C. Harris (1999). “Strategic Experimentation,” Econometrica, 67, 349–374.

[6] Bonatti, A., and J. Hörner (2011). “Collaborating,” American Economic Review, 101, 632–663.

[7] Bonatti, A., and J. Hörner (2015). “Learning to Disagree in a Game of Experimentation,”

working paper, Cowles Foundation for Research in Economics.

[8] Bonatti, A., and H. Rantakari (2016). “The Politics of Compromise,” American Economic

Review, 106, 229–259.

[9] Callander, S. (2011). “Searching and Learning by Trial and Error,” American Economic Review,

101, 2277–2308.

[10] Callander, S., and N. Matouschek (2015). “Managing on Rugged Landscapes,” working paper,

Stanford GSB.

[11] Campbell, A., F. Ederer, and J. Spinnewijn (2014). “Delay and Deadlines: Freeriding and

Information Revelation in Partnerships,” American Economic Journal: Microeconomics, 6,

163–204.

[12] Che, Y-K. and J. Hörner (2015). “Optimal Design for Social Learning,” working paper, Cowles

Foundation for Research in Economics.

28

[13] Cohen, A., and E. Solan (2013). “Bandit Problems with Levy Processes,” Mathematics of

Operations Research, 38, 92–107.

[14] Crawford, V., and J. Sobel (1982). “Strategic information transmission,” Econometrica, 50,

1431–51.

[15] Cripps, M., and C. Thomas (2015). “Strategic Experimentation in Queues,” working paper,

UCL.

[16] Das, K. (2015). “The Role of Heterogeneity in a Model of Strategic Experimentation,” working

paper, University of Exeter.

[18] Frick, M., and Y. Ishii (2015). “Innovation Adoption by Forward-Looking Social Learners,”

working paper, Harvard.

[19] Garfagnini, U., and B. Strulovici (2016). “Social Experimentation with Interdependent and

Expanding Technologies,” Review of Economic Studies, forthcoming.

[20] Gordon, S., C. Marlats, and L. Ménager (2015). “Observation Delays in Teams,” working

paper, Paris.

[21] Gordon, S., E.B. de Villemeur (2010). “Strategic Advice on a Timing Decision,” draft,

Montréal.

[22] Grenadier, S., A. Malenko, and N. Malenko (2015). “Timing Decisions in Organizations:

Communication and Authority in a Dynamic Environment,” working paper, MIT.

[23] Guo, Y. (2016). “Dynamic Delegation of Experimentation,” American Economic Review, 106,

1969–2008.

[24] He, Johanna (2015). “Competition in Social Learning”, working paper, Stanford University .

[25] Heidhues, P., S. Rady, and P. Strack (2015). “Strategic experimentation with private payoffs,”

Journal of Economic Theory, 159, 531–551.

[26] Holmström, J. (1984). “On The Theory of Delegation,” in Bayesian Models in Economic

Theory, eds. Marcel Boyer and Richard Kihlstrom, North-Holland Publishing Co., Amsterdam.

[27] Hörner, J., N. Klein, and S. Rady (2015). “Strongly Symmetric Equilibria in Bandit Games,”

working paper, Yale University.

[28] Kamenica, E., and M. Gentzkow (2011). “Bayesian Persuasion,” American Economic Review,

101, 2590–2615.

[29] Keller, G., and S. Rady (2010). “Strategic Experimentation with Poisson Bandits,” Theoretical

Economics, 5, 275–311.

[30] Keller, G., and S. Rady (2015). “Breakdowns,” Theoretical Economics, 10, 175–202.

[31] Keller, G., S. Rady, and M. Cripps (2005). “Strategic Experimentation with Exponential

Bandits,” Econometrica, 73, 39–68.

29

[32] Khromenkova, D. (2015). “Collective Experimentation with Breakdowns and Breakthroughs,”

working paper, Mannheim University.

[33] Klein, N., and S. Rady (2011). “Negatively Correlated Bandits,” Review of Economic Studies,

78, 693–792.

[34] Kremer, I., Y. Mansour, and M. Perry (2014). “Implementing the “Wisdom of the Crowd”,”

Journal of Political Economy, 122, 988–1012.

[35] Melumad, N., and T. Shibano (1991). “Communication in settings with no transfers,” RAND

Journal of Economics, 22, 173–198.

[36] Murto, P., and J. Välimäki (2011). “Learning and Information Aggregation in an Exit Game,”

The Review of Economic Studies, 78, 1426–1461.

[37] Murto, P., and J. Välimäki (2013). “Delay and information aggregation in stopping games

with private information,” Journal of Economic Theory, 148, 2404–2435.

[38] Orlov, D., A. Skrzypacz and P. Zryumov (2016). “Persuading the Regulator To Wait,” working

paper, Stanford GSB.

[39] Rayo, L., and I. Segal (2010). “Optimal Information Disclosure,” Journal of Political Economy,

118, 949–987.

[40] Renault, J., E. Solan, and N. Vieille (2015). “Optimal Dynamic Information Provision,” work-

ing paper, arXiv:1407.5649.

[41] Rosenberg, D., E. Solan, and N. Vieille (2007). “Social Learning in One-Armed Bandit Prob-

lems,” Econometrica, 75, 1591–1611.

[42] Rosenberg, D., A. Salomon, and N. Vieille (2013). “On Games of Strategic Experimentation,”

Games and Economic Behavior, 82, 31–51.

[43] Smolin, A. (2015). “Optimal Feedback Design,” working paper, Yale University.

Econometrica, 78, 933–971.

[45] Thomas, C. (2015). “Strategic Experimentation with Congestion,” working paper, The Uni-

versity of Texas at Austin.

30

- Don't Increase Federal Gasoline Taxes—Abolish Them, Cato Policy Analysis No. 598Uploaded byCato Institute
- ecaf_867Uploaded byBill Hannegan
- TRUE FALSE q2 Flashcards _ QuizletUploaded bysara
- Hahnel. the Case Against Markets.jei Vol41n4 2007Uploaded byacibils
- An Economical, Political and Philosophical Analysis of Externalities.Uploaded byfadeella
- The Market Failure Revision and ResitsUploaded bySandy Saddler
- L5 Market FailureUploaded byjiro
- Nepalese LaborUploaded byashimaryal
- Review PPtUploaded byNaveen Kumar
- Intro PaperUploaded byhammad_91989249
- Slides5 SignalingUploaded byAsrafuzzaman Robin
- Shops and the City: Evidence on Local Externalities and Local Government Policy from Big-Box BankruptciesUploaded byCato Institute
- GY201_overheads_2015_LT_6-7Uploaded byravisud
- Jignasa Mobile Science Lab writeupUploaded bySevaBharathiTelangana
- EBOOK_IDR2016_FULLREPORT.pdfUploaded bysukuje
- p44.pdfUploaded bypenelopegerhard
- 10_4EUploaded byChintan Patel
- Policy Assignment- Externalities Grp WorkUploaded byMayom Mabuong
- EconomicsUploaded bydeshraj
- C02 Risk Perceptions and ReactionsUploaded byPham Hong Van
- Bah - MacroPrinciples 05 the Public SectorUploaded byrhroca2762
- Scale Up in Public Economics_Odysseas KopsidasUploaded byAnonymous kqqWjuCG9
- Market Mechanism for Electric NetworkUploaded bynaveed
- dp030001.pdfUploaded byEd Z
- Think TopicUploaded byMadalina Pioara
- c05.FailureUploaded byHuyen Nguyen
- Fiscal 1Uploaded byAlexis Poli Gaona Albito
- how seeds move lessonUploaded byapi-280716018
- integrated-science-syllabus.pdfUploaded byAyebu joseph
- Unconventional Charging and Taxation to Support Public Transport and Taxation .PDFUploaded byReona Vulpe

- 2011-12 IDVA Registration PacketUploaded byThorgog Kartan
- RCOG Antenatal CareUploaded byzakioso
- Emil Boller - Examining PsychokinesisUploaded byMatCult
- Egg quality characteristics of pullet chickens fed Neem (AzdirachtaIndica) leaf meal (NLM) managed under two housing systemsUploaded byIJEAB Journal
- Chapter 1 - Human DevelopmentUploaded byAngus Loh
- Parallellsession-WeiderpassVainioUploaded byMarvel Tating
- Lesson 6 Seed GerminationUploaded bySyahierah Balqis
- Mobeen Et AlUploaded bygarjas
- ch01 - quiz1 - answeredUploaded byapi-297614270
- FeldmanEUP11e_PPT_Ch01.pptUploaded byChip Davis
- Warwick2_tcm18-93083Uploaded byjrlr65
- Pedagogy of Integration 061229Uploaded byMonica0312
- The Emotionally Evocative Effects of PaintingsUploaded bychnnnna
- Implementing the Ten Steps to Successful Breastfeeding in Multiple Hospitals Serving LowUploaded byAnonymous Q8Tzxw
- Cell Phone Use and Traffic Crash Risk-A Culpability AnalysisUploaded byvilladolores
- Insecticidal and Repellent Effects of Aristolochia Bracteolata Lam. Against Trogoderma Granarium EvertsUploaded byijsidonlineinfo
- Unit 1 – Intro to Memory and Historical OverviewUploaded bymichelleadine
- Excerpt WorldWithout TX I-258Uploaded byOnPointRadio
- Formative Assessment ProbesUploaded byjakbnimbl7201
- modafiniloUploaded byMaríaAntonieta
- Palmers Law of Life 1958Uploaded byFraterPrierius
- THREE-5665Uploaded byHafiz Ahmed
- 2013 Portfolio David j PearsonUploaded bypearsondjp
- The Effects of Buyer Identification and Purchase Timing OnUploaded byBharichalo007
- Profile LevelingUploaded byvikas53kashyap
- Potential Errors in the Quantitative Evaluation of Biogas Production in Anaerobic Digestion ProcessesUploaded byandresmilquez
- BMEn 3301 Spring 2013 SyllabusUploaded bysfairhu
- Thermodynamics Fundamentals for ApplicationsUploaded byIbraheem ElBakry
- Applying the Reader-response Approach in Teaching English Short Stories to EFL StudentsUploaded byAngelo Postanes
- 3 Lab Exercise UpdatedUploaded byroxette