Empirical IO Notes v4-1

Empirical industrial organization - the basics
Øyvind Thomassen∗
November 2, 2019
∗
oyvind.thomassen@gmail.com, https://sites.google.com/site/oyvindthomassen, Seoul National Univer-
sity, Department of Economics.
1
Contents
1 Structural models 3
1.1 Reasons to use a structural model . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 More on the merged firm’s pricing . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Practical issues in structural modelling . . . . . . . . . . . . . . . . . . . . . 7
2 Basic econometrics 8
2.1 Causal effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Structural equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Randomized experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Instrumental variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 OLS and IV are both method-of-moments estimators . . . . . . . . . . . . . 11
3 Large-sample theory 12
3.1 Convergence in probability and law of large numbers . . . . . . . . . . . . . 12
3.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Intuition for central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Generalized method of moments 15

4.1 Properties of the GMM estimator . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Consistency, asymptotic normality, estimated asymptotic variance . . . . . . 15
4.3 OLS as a GMM estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Two-stage least squares as a GMM estimator . . . . . . . . . . . . . . . . . . 19
4.5 Panel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Multiple equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7 Some alternatives for the weighting matrix . . . . . . . . . . . . . . . . . . . 25
4.8 Sketch of a proof for asymptotic normality . . . . . . . . . . . . . . . . . . . 27
4.9 The role of the rank condition . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Discrete-choice demand 30
5.1 Utility maximization subject to a budget constraint . . . . . . . . . . . . . . 31
5.2 The outside good and normalizations . . . . . . . . . . . . . . . . . . . . . . 33
5.3 The role of εij and common choices for its distribution . . . . . . . . . . . . 33
5.4 The role of ξj in fitting the model to the data . . . . . . . . . . . . . . . . . 35
5.5 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Finding ξj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Random coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.8 Practical issues in estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.9 The firm’s pricing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2
6 Berry, Levinsohn and Pakes (1995) [BLP] 51
6.1 Brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Demand model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Supply model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.6 Discussion of GMM requirements . . . . . . . . . . . . . . . . . . . . . . . . 55
6.7 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Nevo (2001) 58
7.1 Brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.5 Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.6 Using implied markups to determine conduct . . . . . . . . . . . . . . . . . . 63
3
1 Structural models
• In the standard framework of econometrics there is a function f that relates explana-
tory variables x, parameters θ and unobservables e to a dependent variable y by a
claim that at the true parameter value θ0 ,
y = f (x, θ0 , e).
• The linear regression model where f (x, θ, e) = xθ + e is one example.
• In a structural model, the function f explicitly models an economic agent’s (a consumer

or a firm) utility maximization or profit maximization, given the agent’s preferences
and choice set, the market structure, equilibrium solution concept and other aspects
of the economic structure that determine their choice.
• By contrast, a researcher who uses a non-structural (sometimes called ‘reduced form’)

econometric model may be agnostic about the underlying economic structure of the
problem and simply assume that a linear model of the form y = xθ + e gives a good
approximation to how y tends to change in response a change in x (or a specific element
of the vector x).
• Consider the challenge of estimating the demand for J differentiated products (where
each consumer chooses only one of the alternatives).
• A structural model may assume that a consumer i chooses the alternative j that
maximizes an indirect utility function (conditional on j)
uij = xj β − αpj + eij , (1.1)
where xj is a vector of product characteristics and pj is price.
• That is (when x and ei gather (xj , pj ) and eij , respectively, for all j),
fj (x, θ, ei ) = 1[consumer i chooses j|x, θ, e]

= 1 j = arg max {xk β − αpk + eik } (1.2)
1≤k≤J
• If dij = 1[consumer i is observed to choose j], we can match the predictions of the
model (1.2) to dij to estimate the parameters α and β.
• Alternatively, a non-structural approach would be to estimate a linear model with J

equations, each with J parameters (where αjk is the elasticity of demand for j w.r.t.
4
the price of k):
J
X
ln qj = αjk ln pk , j = 1, . . . , J. (1.3)
k=1
1.1 Reasons to use a structural model
• Most empirical research in economics attempts to answer a question of the type ‘how
would y change if x changed?’.
• For some such questions, for instance
y = test scores
x = number of students per teacher,
it may be feasible to conduct a so-called randomized experiment where the number

of students per teacher is varied exogenously (in a way that is not related to other
determinants of test scores).
• A linear regression of y on x may then give a correct answer to the question.
• There might also be naturally occurring exogenous variation in x that could be used
to estimate this effect even without randomized experiments.
• In other situations, such as
y = market price of various beer brands

x = ownership structure; i.e. which brands are owned by which firms,
it is not obvious how to conduct a randomized experiment, since it is neither desirable

nor feasible to force brewers to merge.
• One could imagine using results from other mergers, either in the same or in other
industries.
• But usually there are so many things that vary across settings, that it is unclear
whether effects found in one setting really imply anything at all about another setting.
• A fundamental problem here is that the experiment or counterfactual we would like to

do is at such macro or high level, that there is really only one (or very few) observations
that could be used in the regression of y on x.
• By contrast, the class size example can use observations for a large number of different
classes that can reasonably be assumed independent.
5
• The structural approach to the merger effect problem is to dig into the economic
structure that underlies the effect of y on x: individual demand functions and firms’
profit-maximizing behaviour.
• By using observations of individual beer demand, we can find out how market demand
responds to price changes, i.e. the derivatives of the demand of each beer brand j with
∂Q ∂Q
respect to its own price, ∂pjj , as well as the prices of competing products, ∂pkj .
• These derivatives (or elasticities) could be estimated with a model like (1.3), but it is
likely that when the market structure, and therefore prices, change, the price elasticities
also change.
• On the other hand, it seems more plausible that consumer preferences remain stable,
and therefore that a demand model like (1.2) will be applicable also when the ownership
structure changes.
1.2 Counterfactuals
• One of the great advantages of using a structural models is the ability to answer
counterfactual questions like the effect or merger or a tax reform that has not taken
place.
• Once a fully specified structural model has been estimated, provided we believe the
assumptions (such as functional forms etc.), we know everything we need to know
about the market, much like in a numerical example in a microeconomics textbook,
and we can find new equilibrium prices.
• Often solving for new equilibrium prices after a hypothetical change in the economic
structure (merger, tax) is called simply a counterfactual.
• This can be illustrated by looking at the merger example in some more detail.
• Suppose for simplicity that there are two products j = 1, 2 and that initially the
products are owned by two separate single-product firms, named 1 and 2 after their
products.
• For each j, the first-order condition for maximizing profit (pj − mcj )Qj (pj , pk ) (as-
suming constant marginal cost, mcj and that prices are a Nash equilibrium) gives an
expression for the markup, the so-called Lerner index holds for j = 1, 2,
∂Qj
(pj − mcj ) + Qj = 0 (1.4)
∂pj
pj − mcj 1
= ∂Q p
, (1.5)
pj − ∂pjj Qjj
6
which says that the more responsive demand is to price, the smaller the markup will
be.
∂Qj
• After estimating a demand function like (1.2) we know ∂pj
for both j, and pj and Qj
are observed.
• We can then solve (1.5) for mcj for j = 1, 2.
• We want to know how the markup changes if the two firms merge.
• The merged firm solves the problem
max [(p1 − mc1 )Q1 (p1 , p2 ) + (p2 − mc2 )Q2 (p2 , p1 )] ,

p1 ,p2
which has first-order conditions (for each j = 1, 2 and k 6= j):
∂Qj ∂Qk
(pj − mcj ) + Qj + (pk − mck ) = 0. (1.6)
∂pj ∂pj
• Since mcj for j = 1, 2 are observed, this is a system of two—nonlinear, since prices
enter Q—equations in two unknowns, which can usually be solved numerically.
• The solutions, the new equilibrium prices, then answer the question of how prices will
react to the merger.
1.3 More on the merged firm’s pricing
• We can rewrite the first-order conditions to gain some economic insight:

∂Qk
pj − mcj 1 pk − mck ∂pj
= ∂Q p
+ ∂Q
.
pj − ∂pjj Qjj pj − j ∂pj
• In the beer example, we expect the two products to be substitutes, so ∂Q k

∂pj
> 0, which
makes the last term positive: the markup for each j = 1, 2 will be higher for the merged
firm.
• There are two factors that determine the difference between the single-firm markup
and the two-firm markup:
∂Q
– The magnitude of the diversion ratio ∂Q k
∂pj
/− ∂pjj , which tells us how many units of
demand for k are gained per unit lost of j when pj is raised. The more consumers
who simply switch to k when pj goes up, the more profitable it is for the merged
firm to raise pj .
7
– The margin pk − mck relative to the price of pj ; the more profitable each unit of
k is, the more worthwhile it is to raise pj , since consumers who switch to k will
earn the merged firm a higher unit profit.
1.4 Practical issues in structural modelling
• As opposed to linear models like (1.3), structural models usually requires the solution to
some kind of optimization problem just to calculate the value of the model’s prediction
(f ) at each value of the parameters, like the max operator in (1.2).
• Sometimes this optimization problem can be computationally burdensome to solve, for

instance if the economic agent faces a dynamic problem.
• The second issue is that because the model (f ) is a nonlinear function of the parameters,
the estimates cannot be calculated simply using matrix operations on the data.
• Instead they must be found by some iterative optimization algorithm (numerically

minimizing a GMM function or the negative of a likelihood function).
• We can sum up the different aspects of structural modelling as:
1. Coming up with a model f that makes economic sense and can be estimated in
practice. Writing a computer program to calculate the value of f for each value
of the parameters.
2. Coming up with sensible restrictions (moment conditions or error distributions)
to form an econometric objective function (GMM or likelihood function) and an
expression for standard errors.
3. Using an algorithm for numerical minimization to minimize the objective function.
4. Using the estimates to answer economic question of interest, e.g. do a counter-
factual.
8
2 Basic econometrics
• For any two random variables y and x, common ways of predicting y based on x:
• Conditional expectation: E(y|x)
• Linear projection: LP(y|x) = β0 + β1 x, where
Cov(x, y)
β1 = , β0 = E(y) − β1 E(x). (2.1)
Var(x)
or, equivalently
Cov(x, u) = 0, E(u) = 0, (2.2)
where u is just shorthand for y − β0 − β1 x = y − LP(y|x).
• OLS regression of y on x gives good estimate of coefficients in the linear projection.
• E(y|x) minimizes mean squared prediction error, i.e. solves
min E[(y − h(x))2 ]. (2.3)

h
• LP(y|x) minimizes (2.3) in the class of all functions h that are linear in x.
• Therefore, if E(y|x) is linear, E(y|x) = LP(y|x).
2.1 Causal effect
• Causal effect = change in y when x changes, holding everything else fixed.
• Consider a population of new car models, where y is units sold and x is price.
• Suppose we observe that sales are higher when price is higher: Cov(x, y) > 0.
• Is β1 the causal effect of x on y? (Suppose for simplicity E(y|x) = LP(y|x).)
• Something else is probably going on: cars with higher price have more space, more
powerful engine, leather seats, etc.
• This – not the increase in price – is the reason Cov(x, y) > 0.
• Then β1 is not the causal (all else held equal) effect.
9
2.2 Structural equation
• A structural equation is a model where the parameters have a causal interpretation.
• As the term is used at the beginning of this document, a structural model is always
a structural equation, while a structural equation does not need to be a structural
model, i.e it does not have to explicitly model the optimizing behaviour of economic
agents.
• Continuing with the car example, let the structural equation be
y = δ0 + δ1 x + ω. (2.4)
• The error term ω contains all other factors (than x) that affect y (space, engine power
etc.).
• price tends to be higher for high ω: Cov(x, ω) > 0
• The linear projection can be written on the same linear form as the structural equation:
y = β0 + β1 x + u, (2.5)
where u = y − LP(x).
• But it is only if Cov(x, ω) = 0 and E(ω) = 0 that δ1 = β1 and δ0 = β0 , and that OLS
can be used to obtain the causal effect.
• β1 does not take into account the fact that some of the change in y as x increases
comes from ω, not from x itself.
• To see this:1
Cov(x, ω)
β 1 = δ1 + . (2.6)
Var(x)
• Since Cov(x, ω) > 0 in this example, β1 > δ1 .

1
Proof: using (2.4),
Cov(x, y) = Cov(x, δ0 + δ1 x + ω)
= δ1 Var(x) + Cov(x, ω)
Cov(x, y) Cov(x, ω)
= δ1 + .
Var(x) Var(x)
10
• Note that the structural equation is not simply a feature of the joint distribution of y
and x.
• Suppose for concreteness that β1 = 3.

Cov(x,ω) Cov(x,ω)
• This β1 could be generated by δ1 = 3 and Var(x)
= 0, or by δ1 = −10 and Var(x)
= 13.
• The joint distribution of y and x does not imply anything about the structural model:
structural equation & Cov(x, ω) ⇒ joint distribution of y and x

joint distribution of y and x 6⇒ structural equation.
• Only with additional assumptions can we draw conclusions about the structural equa-
tion from the joint distribution of y and x.
• The most standard such assumption is that Cov(x, ω) = 0, also called exogeneity of x.
• If x is exogenous, β1 = δ1 , and OLS gives a good estimate of the causal effect δ1 .
2.3 Randomized experiment
• What if x is not exogenous in the population?
• Use a computer program to generate an arbitrary price for each car model.
• We have now created a new population (or sample) where Cov(x, ω) = 0 by design.
• Next we offer the cars for sale at these prices, and obtain a value of y corresponding
to each value of x.
• Now the linear projection corresponds to the structural equation, since x is exogenous,
and OLS works.
2.4 Instrumental variable
• As an alternative, it is sometimes possible to find a third variable z such that:
Cov(z, x) 6= 0, Cov(z, ω) = 0 (2.7)
• Suppose the price faced by consumers includes a sales tax, and that the tax is somewhat
arbitrary – not correlated with ω.
• Then the amount paid in tax, z, satisfies (2.7).
11
• Also a natural experiment: an event that creates exogenous variation in the endogenous
variable x.
• We can now use (2.7) and (2.4) to find the causal effect δ1 :
Cov(z, y) = Cov(z, δ0 + δ1 x + ω)
= δ1 Cov(z, x) + Cov(z, ω)
= δ1 Cov(z, x)
Cov(z, y)
δ1 = ,
Cov(z, x)
• With a random sample (yi , xi , zi ) we can use the sample versions of the covariances in
the last line to get a good estimator of δ1 (called the IV estimator ).
2.5 OLS and IV are both method-of-moments estimators
• The requirements for the linear projection can be rewritten as E(u) = 0 and E(xu) =
Cov(x, u) + E(x)E(u) = Cov(x, u) = 0.
• The OLS estimator (β̂0 , β̂1 ) and the IV estimator (γ̂0 , γ̂1 ) both solve the sample versions
of the respective exogeneity conditions:
n
X n
X
[xi (yi − β̂0 − β̂1 xi )] = 0, (yi − β̂0 − β̂1 xi ) = 0
i=1 i=1
X n Xn
[zi (yi − γ̂0 − γ̂1 xi )] = 0, (yi − γ̂0 − γ̂1 xi ) = 0.
i=1 i=1
12
3 Large-sample theory
• Let {xi } = x1 , x2 , . . . be a sequence of L × 1 random vectors that are independent and
identically distributed (i.i.d.) with mean E(xi ) = µ and (L × L) covariance matrix
Var(xi ) = Σ.
• Then the sample mean x̄n = n1 ni=1 xi has mean and variance
P
1
E(x̄n ) = nE(xi ) = µ
n
1
Var(x̄n ) = nVar(xi ) = Σ/n.
n2
3.1 Convergence in probability and law of large numbers
• As n goes to infinity, Var(x̄n ) = Σ/n approaches zero.
• Then x̄n (not just its expectation) must approach µ.
• Concretely, the probability that the distance between x̄n and µ be greater than any
small number tends to zero as n goes to infinity. This is called convergence in proba-
bility, and written as
p
x̄n −→ µ.
• The fact that the sample mean across an i.i.d. sample converges in probability to the
population mean is called the Law of Large Numbers.
3.2 Convergence in distribution

√
• If instead we consider n times the sample mean, the variance no longer approaches
zero, and the randomness remains:
√ √ √ √
E( nx̄n ) = nµ, Var( nx̄n ) = ( n)2 Var(x̄n ) = Σ.
√
• Given that the mean and variance of n(x̄n −µ) are unchanged as n grows, it is natural
to ask what its distribution is.
• The (Lindeberg-Lévy) Central Limit Theorem says that it converges to a random vector
with a normal distribution as n gets large:
√ d
n(x̄n − µ) −→ N (0, Σ).
13
3.3 Intuition for central limit theorem
• No assumption has been made about the distribution of xi .
• It is therefore striking that its limiting distribution is necessarily normal.
• To get a sense of why (without any formal proof), consider a scalar (L = 1) example,
where xi is the number shown on a die when thrown the i-th time.
• Clearly xi has a (discrete) uniform distribution where the probability of each outcome
is 16 .
• How does x̄n for such uniformly distributed random variables get a bell-shaped distri-
bution as n increases?
• Consider the support of x̄n for n = 2.

1
• Its extremes are 1 and 6. Each of these outcomes each occur only with probability 62
since only one sequence of outcomes, {1, 1} and {6, 6}, respectively, gets us there.
• On the other hand, a point in the middle of the support, 27 , can come about in six
different ways, as {1, 6}, {2, 5}, {3, 4}, {4, 3}, {5, 2} or {6, 1}.
6
• Its probability is therefore 62
= 61 .
• Continuing to an arbitrary n, the extreme outcomes 1 and 6 have the low probability
1
6n
, while the probability of outcomes near 3.5 is much higher.
• As n increases, x̄n takes on values only between 1 and 6, and gets increasingly concen-
trated at 3.5 as the variance approaches zero.
√
• By contrast, n[x̄n − E(xi )] has a support that approaches (−∞, ∞) as n increases.
• Figure
√ 1 shows histograms representing the distributions of x̄n (left column) and
n[x̄n − E(xi )] (right column) for rolling a die n times, with n = 1, 2, 3, 10, 100, 10000.
• If we roll a die n times, the resulting x̄n is of course a number.
• To see what the distribution of x̄n is for a given value of n, we must generate a large
number, M , of samples, each of size n and calculate x̄n for each of the M samples.
• The left column serves to illustrate the Law of Large Numbers, as x̄n concentrates
around 3.5 as n increases.
14
√
• The right column illustrates the Central Limit Theorem, as the distribution of n[x̄n −
E(xi )] approaches the bell shape (with centre zero), characteristic of the normal dis-
tribution.
1
Pn √
Figure 1: Distribution of 20,000 draws of x̄n = n i=1 xi (left column) and n[x̄n − E(xi )] (right column),
for n = 1, 2, 3, 10, 100, 10000, where xi has a discrete uniform distribution with support {1, . . . , 6}.
15
4 Generalized method of moments
• Let w1 , . . . , wn be a sample from a sequence of data vectors {wi }, and g(wi , θ) an L × 1
vector that depends on wi and a parameter vector θ.
• A function θ̂ of the sample defined as

n
hX n
i0 h X i
θ̂ = arg min g(wi , θ) W g(wi , θ) , (4.1)
θ
i=1 i=1
where W is an L × L positive definite matrix, is called a GMM estimator.
4.1 Properties of the GMM estimator
• Suppose
1. The sequence of data vectors {wi } is i.i.d. 2

2. The parameter vector θ is K × 1, with K ≤ L.
3. E[g(wi , θ0 )] = 0.
∂
4. The L × K matrix G = E ∂θ g(wi , θ0 ) has rank K.
3
• Then there exists a θ̂ that satisfies (4.1), and for which the following holds:
4.2 Consistency, asymptotic normality, estimated asymptotic variance

p
• θ̂ −→ θ0 as n −→ ∞.
√ d
• n(θ̂ − θ0 ) −→ N [0, A−1 BA−1 ] as n −→ ∞, where
A = G0 W G, B = G0 W V W G, V = E[g(wi , θ0 )g(wi , θ0 )0 ].
\θ̂) = Â−1 B̂ Â−1 /n, where

• Avar(
Â = Ĝ0 W Ĝ, B̂ = Ĝ0 W V̂ W Ĝ,

2
The result also holds for non-i.i.d. but ergodic stationary data generating processes provided that there
√ Pn d
is a central limit theorem to show that n n1 i=1 g(wi , θ0 ) −→ N (0, V ). See Hayashi (2000): Econometrics,
Propositions 7.7 and 7.10. For instance, if {g(wi , θ0 )} is a martingale difference sequence (Hayashi, top of p.
104), there is such a result (Hayashi, bottom of p. 106). See Sections 6.5-6.6 in Hayashi for data generating
processes with serial correlation, for which there is also a CLT.
3 p
If W depends on the data, as it usually does, it must be the case that W −→ W0 where W0 is positive
definite (and W should be replaced with W0 in 4.2). For standard choices of W , this is the case.
16
and n n
1X ∂ 1X
Ĝ = g(wi , θ̂), V̂ = g(wi , θ̂)g(wi , θ̂)0 .
n i=1 ∂θ n i=1
\θ̂).
• The standard errors s.e.(θ̂) are the square roots of the diagonal of Avar(
• We can form 1 − α confidence intervals for each element θk of θ as
[θ̂k − s.e.(θ̂k )zα/2 , θ̂k + s.e.(θ̂k )zα/2 ],
where zα/2 is defined by Prob[Z > zα/2 ] = α/2 for Z ∼ N (0, 1).
• Most commonly, letting α = 0.05, zα/2 = z0.025 = 1.96. Then the confidence interval
does not contain zero if |θ̂k |/s.e.(θ̂k ) > 1.96, i.e. provided that the magnitude of the
parameter estimate is about twice as large as the standard error.
• The optimal weighting matrix (that minimizes the variance of the estimator) is W =
V̂ −1 , where V̂ is a consistent estimator of the covariance matrix V .
• Let θ̃ be a consistent estimator of θ and let V̂ = n1 ni=1 g(wi , θ̃)g(wi , θ̃)0 .
P
• Then, using W = V̂ −1 as weighting matrix means that B̂ = Â, and we get a simpler
expression for the asymptotic variance:
\θ̂) = Â−1 /n.

Avar(
• As long as W is a consistent estimator of V −1 , we can also recalculate V̂ at θ̂ to get

n h1 X n i−1 o−1
\
Avar(θ̂) = Ĝ0
g(wi , θ̂)g(wi , θ̂)0 Ĝ n.
n i=1
• It can be useful in practice to note that all the divisions by n cancel in this expression,
so we get:
n i0 h Xn i−1 h Xn io−1
nh X ∂ 0 ∂
g(wi , θ̂) g(wi , θ̂)g(wi , θ̂) g(wi , θ̂) .
i=1
∂θ i=1 i=1
∂θ
4.3 OLS as a GMM estimator
• Let wi = {xi , yi } be i.i.d., where xi = (xi1 , . . . , xiK )0 is K × 1 (and xi1 = 1), θ is K × 1,

yi is a scalar, and
g(wi , θ) = xi (yi − x0i θ),
so L = K.
17
• The identification requirement 3. is now
E[xi (yi − x0i θ0 )] = 0

(K×1)
or
E[yi − x0i θ0 ] = 0 and E[xik (yi − x0i θ0 )] = 0 for k = 2, . . . , K. (4.2)
∂
• Since ∂θ g(wi , θ0 ) = −xi x0i , the rank requirement 4. is that the variance matrix E[xi x0i ]
of xi be of rank K.
• Is the parameter θ identified? Yes – we can express θ0 as a function of population

moments:
E[xi (yi − x0i θ0 )] = 0

E[xi yi ] − E[xi x0i ]θ0 = 0
E[xi x0i ]θ0 = E[xi yi ]
θ0 = E[xi x0i ]−1 E[xi yi ],
where E[xi x0i ] is nonsingular since it is of full rank.
• It is helpful to arrange the data in vectors and matrices, as follows:

   
x01 y1
 ..   .. 
X =  . , y =  . . (4.3)
(n×K) (n×1)
x0n yn
• Then the GMM objective function in (4.1) is
[(y − Xθ)0 X]W [X 0 (y − Xθ)], (4.4)
where W is a positive definite K × K matrix, which means that the value of (4.4) is
positive unless X 0 (y − Xθ) = 0, in which case its value is zero.
• Then the minimization problem in (4.1) is solved if there is a θ̂ that solves the equations
X 0 X θ̂ = X 0 y . (4.5)
(K×K) (K×1)
18
• Loosely speaking, since
 
n x2i1 . . . xi1 xiK
1 X  .. ... .. 
X 0 X/n =  . . 
n i=1 2
xiK xi1 . . . xiK
converges to E[xi x0i ] as n gets large, X 0 X has rank K with probability one for large
samples.
• In any case, as long as X 0 X has full rank in the sample, it is nonsingular and (4.5),
called the normal equation, has a unique solution,
θ̂ = (X 0 X)−1 X 0 y. (4.6)
• Since W does not enter (4.6), we do not really need to choose a weighting matrix W in
order to calculate the estimates from the sample (all positive definite K × K matrices
W will result in the same estimator).
• For calculations on a computer, it is better to find θ̂ as the solution to (4.5) rather

than computing the inverse as in (4.6).
• For calculating standard errors, note that

n n
1X ∂ 1X
Ĝ = g(wi , θ̂) = − xi x0i = −X 0 X/n
n i=1 ∂θ n i=1
and n n
1X 1X
V̂ = [xi (yi − x0i θ̂)][xi (yi − x0i θ̂)]0 = (yi − xi θ̂)2 xi x0i .
n i=1 n i=1
• Although the weighting matrix plays no role in deriving the OLS estimator, the choice
W = (X 0 X/n)−1 = −Ĝ−1 results in the usual (heteroskedasticity-robust) standard
errors.
• In this case, Ĝ0 W = −IK (the identity matrix), so that
\θ̂) = [Ĝ0 W Ĝ]−1 Ĝ0 W V̂ W Ĝ[ĜW 0 Ĝ]−1 /n

Avar(
= (X 0 X/n)−1 V̂ (X 0 X/n)−1 /n
hX n i
= (X 0 X)−1 (yi − xi θ̂)2 xi x0i (X 0 X)−1 .
i=1
19
• Under the common textbook assumption of homoskedasticity, we replace each squared
residual (yi − xi θ̂)2 with the sample average n1 ni=1 (yi − xi θ̂)2 , so that
P
h n n
\θ̂) = (X 0 X)−1 1 X(y − x θ̂)2 X x x0 (X 0 X)−1
i
Avar( i i i i
n i=1 i=1
n
−1 1
h X i
0
= (X X) (yi − xi θ̂)2 X 0 X(X 0 X)−1
n i=1
n
−1 1
h X i
0 2
= (X X) (yi − xi θ̂) ,
n i=1
(where the n is often replaced with n − K to correct for finite-sample bias).
4.4 Two-stage least squares as a GMM estimator
• Let wi = {zi , xi , yi } be i.i.d., where zi = (zi1 , . . . , ziL )0 is L × 1, xi = (xi1 , . . . , xiK )0 is

K × 1 (where xi1 = 1), θ is K × 1, L ≥ K, yi is a scalar, and
g(wi , θ) = zi (yi − x0i θ).
• The identification requirement 3 is that
E[zi (yi − x0i θ0 )] = 0

(L×1)
or
E[zik (yi − x0i θ0 )] = 0 for k = 1, . . . , K. (4.7)
∂
• Since ∂θ g(wi , θ0 ) = −zi x0i , the rank requirement 4. is that the covariance matrix E(zi x0i )
be of rank K.
• As in the OLS case, the identification condition is a system of linear equations with θ0
as the vector of unknowns:
E[zi x0i ]θ0 = E[zi yi ]. (4.8)
• If L = K, the K × K matrix E[zi x0i ] is nonsingular, since it has rank K, so (4.8) has a
unique solution E[zi x0i ]−1 E[xi yi ].
• If L > K, as long as the rank of E[zi x0i ] is K, the system is equivalent to one with K
independent linear equations, so there cannot be multiple solutions to (4.8).
• The sample analogue of (4.8) is
(Z 0 X/n)θ = Z 0 Y /n. (4.9)
20
• Because of sample variance, it likely that these L equations are all at least slightly
different, in the sense that the L × (K + 1) augmented matrix [Z 0 X|Z 0 Y ] of the system
will have full rank.
• Then if L > K, there is no K × 1 vector θ that exactly satisfies all the L equations.
• We can still try to approximately satisfy each equation, but we need to make a choice
as to how to trade off errors in different equations.
• This choice is determined by the weighting matrix, which for the 2SLS estimator is
W = (Z 0 Z/n)−1 where Z stacks the zi0 vectors vertically, in the same way as X:
 
z10
Z =  ...  . (4.10)
 
(n×L)
zn0
• The GMM objective function in (4.1) is
[(y − Xθ)0 Z]W [Z 0 (y − Xθ)], (4.11)
whose minimum, zero, is only achieved if L = K.
• This special case of 2SLS is called the instrumental variables (IV) estimator, and is
θ̂ = (Z 0 X)−1 Z 0 y.
• Otherwise, we proceed by solving the first-order conditions for minimization for the
estimator θ̂:
∂
0 = [(y − X θ̂)0 Z]W [Z 0 (y − X θ̂)]
∂θ
= −2X 0 ZW [Z 0 (y − X θ̂)]
X 0 ZW Z 0 X θ̂ = X 0 ZW Z 0 y (4.12)
θ̂ = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 y. (4.13)
• Again, we use (4.12), not (4.13) for the actual calculation.
• For the standard errors,

n
1X 0
Ĝ = − zi xi = −Z 0 X/n
n i=1
21
and
n
1X
V̂ = [zi (yi − x0i θ̂)][zi (yi − x0i θ̂)]0
n i=1
n
1X
= (yi − x0i θ̂)2 zi zi0 .
n i=1
• The weighting matrix W = (Z 0 Z/n)−1 results in the usual (heteroskedasticity-robust)

standard errors:
\θ̂) = [Ĝ0 W Ĝ]−1 Ĝ0 W V̂ W Ĝ[ĜW 0 Ĝ]−1 /n
Avar(
= [X 0 Z(Z 0 Z)−1 Z 0 X]−1 X 0 Z(Z 0 Z)−1
hX n i
× (yi − x0i θ̂)2 zi zi0 (Z 0 Z)−1 Z 0 X
i=1
× [X Z(Z 0 Z)−1 Z 0 X]−1 .
0
• It is easy to see from these results that OLS is a special case of 2SLS where zi = xi .
4.5 Panel data
• Let wi = {zi , xi , yi } be i.i.d. where zi = (zi1 , . . . , ziT ) is L × T and zit is L × 1,

xi = (xi1 , . . . , xiT ) is K × T and xit is K × 1, with L ≥ K, and yi = (yi1 , . . . , yiT )0 is
T × 1 where yit is a scalar. The moments are
T
X
g(wi , θ) = zit (yit − x0it θ) = zi (yi − x0i θ).
t=1
• The identification requirement 3. is that

T
nX o
E zit (yit − x0it θ0 ) = E[zi (yi − x0i θ0 )] = 0 . (4.14)
(L×1)
t=1
PT 0
PT 0
• Since E t=1 zit (y it − x it θ) = t=1 E[zit (yit − xit θ)], a more easily interpretable
assumption, that implies (4.14), is
E[zit (yit − x0it θ)] = 0 for t = 1, . . . , T.

hP i
T 0
• The rank requirement 4 is that the L × K covariance matrix E t=1 (zit xit ) = E[zi x0i ]
22
be of rank K.
• We create the following data matrices for a sample of size n. They look like the data
matrices in (4.3) and (4.10), but for each observation i there is now T rows instead of
only one:      
z10 x01 y1
 ..   ..   .. 
Z =  . , X =  . , y =  . . (4.15)
(nT ×L) (nT ×K) (nT ×1)
0 0
zn xn yn
• The GMM objective function in (4.1) is then

n
hX n
i0 h X i
g(wi , θ) W g(wi , θ)
i=1 i=1
n X
hX T n X
i0 h X T i
= zit (yit − x0it θ) W zit (yit − x0it θ) (4.16)
i=1 t=1 i=1 t=1
= [(y − Xθ)0 Z]W [Z 0 (y − Xθ)] (4.17)
– the same expression as in (4.11), although the data matrices are now the panel data
versions defined in (4.15). The estimator is derived in the same way as (4.13), and
therefore yields the same expression:
θ̂ = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 y.
• For the standard errors,

n T
1 XX
Ĝ = − zit x0it = − Z 0 X /n
(L×K) n i=1 t=1 (L×1)(1×K) (L×nT )(nT ×K)
and
n T T i0
1 XhX 0
ih X
0
V̂ = zit (yit − xit θ̂) zit (yit − xit θ̂) , (4.18)
n i=1 t=1 t=1
which can also be written as n1 ni=1 [zi (yi − x0i θ̂)][zi (yi − x0i θ̂)]0 .
P
• Inspection of (4.16) reveals that the GMM estimator defined here is exactly the same
as if we ignored the panel structure of the data and simply formed a 2SLS estimator,
but where the sample size is nT instead of n.
• The panel structure of the data only shows up in the expression for V̂ in (4.18): the
sample average is here P
over i only, while the covariance matrix for each observation i
is between the objects Tt=1 zit (yit − x0it θ) that are sums over T .
23
• Nothing in this section, other than the simplification of the notation, depends on the
assumption that T is the same for each i; it would be possible to let the number of
time periods depend on i (this is called an unbalanced panel).
• No particular use has been made of the fact that panel data normally means that t
represents time periods.
• The estimator as developed so far would work equally well for a clustered sample, for
instance where i represents school classes and t individual students in a class
• The standard errors that will result from the V̂ given in (4.18) correspond to the
clustered standard errors recommended for this case (see Wooldridge (2010), (20.25),
p. 865 for the OLS case).
• The underlying principle in both the panel and clustered sample cases is that while
we are happy to assume that observations are i.i.d. across i, we believe there may be
some form of dependence between t for a given i.
• Both the consistency of the estimator and the variance calculation in (4.18) depend
only on the i-observations being i.i.d., and allows for any form of dependence in the t
dimension.
• We could use W = (Z 0 Z/n)−1 to get a panel 2SLS estimator, in which case the standard
errors would be as for 2SLS (but with the data matrices defined in (4.15)).
• If instead we choose the optimal weighting matrix W = V̂ −1 (where V̂ would typically

be calculated with first-stage estimates θ̃ obtained with the 2SLS weighting matrix),
(4.2) gives
\θ̂)
Avar(
n hX T
n X T
X 0 i−1 o−1
= X 0Z zit (yit − x0it θ̂) zit (yit − x0it θ̂) Z 0X .
i=1 t=1 t=1
4.6 Multiple equations
• Let wi = {{zi1 , . . . , ziM }, {xi1 , . . . , xiM }, (yi1 , . . . , yiM )0 }, where zim is Lm × 1, xim is
Km × 1, with Lm ≥ Km , and yim is a scalar.
• For a given i, different m are different variables rather than the same variables at
different times.
• For instance, with M = 2, yi1 and yi2 could be quantity supplied and demanded of
product i, respectively.
24
• Each m has its own set of moment conditions, so that the vector of moments is:
 
zi1 (yi1 − x0i1 θ1 )
g(wi , θ) =  ..
.
 
.
0
ziM (yiM − xiM θM )
Define   
zi1 yi1
 zi2   yi2 
P zi = , yi =  ..  , (4.19)
   
...  . 
( m Lm ×M )   (M ×1)
ziM yiM
where any blank entry of zi is zero.
• We then stack the data in the same way as for the panel data case:
   
z10 y1
 ..   .. 
Z
P =  . , y =  . . (4.20)
(nM × m Lm ) (nM ×1)
0
zN yN
• Depending on the relationship between the different θm we may wish to organize the
xi in different ways.
• If θm = θ for all m, it follows that Km = K for all m. Then define the K × M matrix
xi = (xi1 , xi2 , . . . , xiM ).
• If each m has its own θm , with no overlap, define
 
xi1
 xi2 
x = . (4.21)
 
i  ..
P
( m Km ×M )  . 
xiM
• In intermediate cases where different m have some elements of the coefficient vector θ
in common, but not all, do something like (4.21), but with those elements of different
xim that share a coefficient moved so that they are placed in the same row in xi .
• In all three cases we stack the observations i vertically to get
 
x01
 .. 
X =  . , (4.22)
(nM ×K)
x0N
25
where K is the length of the θ vector.
• We can then write the moments on the familiar form
g(wi , θ) = zi (yi − x0i θ)
and the GMM objective function is
[(y − Xθ)0 Z]W [Z 0 (y − Xθ)],
which again gives the expression
θ̂ = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 y.
for the estimator.
• The covariance matrix of the moments is also as before (given the new definitions of
the data matrices)
n i0
1 Xh ih
V̂ = zi (yi − x0i θ̂) zi (yi − x0i θ̂) , (4.23)
n i=1
and finally
h n
\θ̂) = X 0 Z X z (y − x0 θ̂) z (y − x0 θ̂) 0 −1 Z 0 X −1
n i o
Avar( i i i i i i (4.24)
i=1
if we use an efficient weighting matrix W .4
• The expression V̂ in (4.23) shows that we allow for an unrestricted (since it is deter-
mined by the data) correlation pattern across the moments of different equations.
• By contrast, estimating each of the M equations separately would be equivalent to

imposing that V̂ be block diagonal (both for the weighting matrix and standard errors)
– i.e. assuming that they are uncorrelated.
4.7 Some alternatives for the weighting matrix
• This subsection mentions some alternative choices of weighting matrix, which may
work better in some cases.
i.e. a consistent estimator of V −1 , typically obtaining θ̃ with (Z 0 Z/n)−1 as the first-stage weighting
4
matrix (which gives the system 2SLS estimator).
26
• Instead of the moment covariance matrix V̂ used above, one can use the centred mo-
ments to form the covariance matrix:
n i0
1 Xh ih
V̂ = g(wi , θ̂) − ḡ g(wi , θ̂) − ḡ ,
n i=1
1
Pn
where ḡ = n i=1 g(wi , θ̂). (See Bruce Hansen: Econometrics, version January 2018,
p. 385.)
• Since ḡ — evaluated at a consistent estimator θ̂ — converges in probability to E[g(wi , θ0 )] =

0, the centred and uncentred versions are asympotically equal, although they will differ
in a given sample.
• One disadvantage of the uncentred weighting matrix is that a moment whose sample
value is far from zero at the preliminary estimates will receive a low weight with the
resulting (estimated) optimal weighting matrix, even if its variance is not very large.
Centering the moments is a remedy for this problem.
• The standard approach is to minimize the GMM objective, updating the weighting
matrix, minimizing again, etc., possibly multiple times. Instead, it is possible to update
the weighting matrix continuously, i.e. defining the estimator as:
n
hX i0 n
hX i
−1
θ̂ = arg min g(wi , θ) V̂ (θ) g(wi , θ) , (4.25)
θ
i=1 i=1
where V̂ (θ) is the (centred or uncentred) covariance matrix of the moments evaluated
at θ. (See Bruce Hansen: Econometrics, version January 2018, p. 392.)
• Very often the moment is the product of instruments and an additive prediction error,
i.e. the l-th entry in the L × 1 vector g(wi , θ) is
gl (wi , θ) = zli [yi − f (xi , θ)] = zli yi − zli f (xi , θ),
where zli is a (scalar) instrument.
• In this case, letting the weighting matrix be the diagonal matrix whose (l, l) entry is
1
Pn 2
( i=1 zli yi )
27
results in the estimator
L
"P #2
n
li yi − zi f (xi , θ)
i=1 zP
X
θ̂ = arg min n , (4.26)
i=1 zli yi
θ
l=1
i.e. that minimizes the sum of the squared percentage deviations of the moments.
• The advantage of (4.26) is that it is intuitive, in the sense that the contribution of
each moment to the total value of the objective function is transparent. This may
for instance reveal that one moment is very large, and suggest ways of improving the
specification of the model so as to fit this moment better.
• This estimator can be used either as a first stage, to obtain estimates for the optimal
weighting matrix, or for the final estimates. (See Low and Meghir (2017): The Use of
Structural Models in Econometrics, Journal of Economic Perspectives, p. 52-53.)
4.8 Sketch of a proof for asymptotic normality
• You may find this section useful to get some sense of where the large-sample properties
of the GMM estimator comes from, but no further use will be made of this material,
so it can be skipped.
• To simplify the notation, write ḡ(θ) = n1 ni=1 g(wi , θ).

P
√
• By the Central Limit Theorem, n times the sample average of the moments has an
asymptotic normal distribution5
√
d
n ḡ(θ0 ) − 0 −→ N (0, V ), (4.27)
where the zero on the left-hand side comes from assumption 3.
• The first-order conditions for (4.1) are

h∂ i0
0= ḡ(θ̂) W ḡ(θ̂). (4.28)
∂θ
• A first-order Taylor approximation of ḡ(θ̂) around θ0 gives

h∂ i
ḡ(θ̂) ≈ ḡ(θ0 ) + ḡ(θ0 ) (θ̂ − θ0 ). (4.29)
∂θ
5
For more detail than provided here, see Theorems 14.1 and 14.2 in Wooldridge (2010): Econometric
Analysis of Cross Section and Panel Data, 2nd ed. and Theorems 2.6 and 3.4 in Newey and McFadden
(1994): Large sample estimation and hypothesis testing, Ch. 36, vol. 4, Handbook of Econometrics.
28
• Substituting (4.29) into (4.28) gives
h∂ i0 h∂ i0 h ∂ i
0 ≈ ḡ(θ̂) W ḡ(θ0 ) + ḡ(θ̂) W ḡ(θ0 ) (θ̂ − θ0 )
∂θ ∂θ ∂θ
√
h
∂ i0 h ∂ i −1 h ∂
i0 √
n(θ̂ − θ0 ) ≈ − ḡ(θ̂) W ḡ(θ0 ) ḡ(θ̂) W nḡ(θ0 )
∂θ ∂θ ∂θ
√
≈ −(G0 W G)−1 G0 W nḡ(θ0 ),
∂ ∂
where the last line holds for large n, since ∂θ
ḡ(θ̂) and ∂θ
ḡ(θ0 ) both converge in proba-
bility to G.
• Finally, by the delta method (which is the asymptotic version of the fact that z ∼
N (0, Σ) implies Cz ∼ N (0, CΣC 0 ) where C is a matrix and z a vector), (4.27) implies
4.2.
4.9 The role of the rank condition
• This section attempts to provide some intuition for Requirement 4. of the GMM
estimator.
• Consider 2SLS as a GMM estimator for K = L = 3, so xi = (1, xi2 , xi3 )0 and zi =

(1, zi2 , zi3 )0 .
• Here ∂θ∂
g(wi , θ) = −zi x0i , so the requirement 4. is that the following 3 × 3 matrix has
full rank:
    
zi1 1 E(x i2 ) E(x i3 )
E zi2  xi2 xi2 xi3  = E(zi2 ) E(zi2 xi2 ) E(zi2 xi3 ) .
zi3 E(zi3 ) E(zi3 xi2 ) E(zi3 xi3 )
• Let ∼ denote equivalence in the sense of rank-preserving elementary row operations.
• If we subtract E(zi2 ) times the first row from the second row, and E(zi3 ) times the first
row from the third row, we get
 
1 E(xi2 ) E(xi3 )
∼ 0 Cov(zi2 , xi2 ) Cov(zi2 , xi3 )
0 Cov(zi3 , xi2 ) Cov(zi3 , xi3 )
• We see that the rank is 2 if either of the following is true:
– One of the variables xi2 or xi3 has zero covariance both with zi2 and zi3 .
– One of the instruments zi2 or zi3 has zero covariance both with xi2 and xi3 .
29
– More generally, there is a γ such that Cov(zil , xi3 ) = γCov(zil , xi2 ) for l = 2, 3 —
i.e. both instruments are related to xi2 and xi3 in the same way.
• For the general case with moments g(wi , θ0 ) = (g1 , g2 )0 where K = L = 2,

∂ ∂
g
∂θ1 1 ∂θ2 1
g
E ∂ ∂
g
∂θ1 2 ∂θ2 2
g
fails to have full rank if:

∂
– One of the parameters θk has no effect on any of the moments, i.e. g
∂θk l
= 0 for
l = 1, 2.
∂
– One of the moments gl does not react to any of the parameters, i.e. g
∂θk l
= 0 for
k = 1, 2.
– More generally, there is a γ such that ∂θ∂ 2 gl = γ ∂θ∂ 1 gl for l = 1, 2 — i.e. two
moments respond to θ1 and θ2 in the same way.
• However, either of the following patterns are fine

∂ ∂
g1 0 0 g1
E ∂θ1 ∂ , E ∂ ∂θ2
,
0 g
∂θ2 2
g
∂θ1 2
0
assuming the entries not written as zeros do not equal zero.
• Loosely speaking, we can sum up the rank condition as requiring there to be at least
as many independent sources of variation — i.e. relationships between a moment and
a parameter — as there are parameters to estimate.
30
5 Discrete-choice demand
• Consider again the utility model (1.1),
uij = xj β − αpj + eij (5.1)
• If we observed a vector zi of characteristics of the consumer (such as income, household

size, rural/urban etc.), we could make eij a function of these variables, so that
uij = xj β − αpj + g(zi , xj , γ) + ẽij , (5.2)
where the function g allows for interactions between zi and product attributes (xj , pj ),
γ are parameters to be estimated, and ẽij is the remaining error term.
• If instead we have no information about individual consumers, we can let eij = ξj + εij
where ξj is a component common to all consumers and εij is a random term, i.i.d.
across i, and often but not necessarily i.i.d. across j, so
uij = xj β − αpj + ξj + εij . (5.3)
• To use (5.3) for estimation, we need to transform it into something that corresponds
to our dependent variable, which is the market share, sj .
• Consumer i chooses alternative j if uij ≥ uik for all k = 1, . . . , J.
• Letting θ = (α, β) and ξ = (ξ1 , . . . , ξJ ), the probability that i chooses j is then

h i
P (j|i, θ, ξ) = E 1(j = arg max uik )
k
Z h i
= 1(j = arg max(xj β − αpj + ξj + εij )) f (εi )dεi , (5.4)
k
where f is the joint density function of the random variables εi = (εi1 , . . . , εiJ )0 .
• Since the only consumer-specific term in utility is εi , and f is assumed to be the same
for all i, the probability is the same for all consumers, so we have
P (j|θ, ξ) = P (j|i, θ, ξ), (5.5)
where the right-hand side is given by (5.4).
• By contrast, if we observe demographics zi , assume that ẽij = ξj + εij in (5.2), and let
31
θ = (α, β, γ), we get
P (j|i, θ, ξ)
Z h i
= 1(j = arg max(xj β − αpj + g(zi , xj , pj , γ) + ξj + εij ) f (εi )dεi , (5.6)
k
in which case the simplification (5.5) no longer holds.
• Broadly speaking, estimation will take the form of choosing the value of the parame-
ters θ and the error term ξ so that the model’s predictions match the corresponding
observations.
• More concretely, if we have individual-level information zi and 0/1 purchasing decisions

dij = 1[consumer i is observed to choose j], estimation will attempt to choose θ and ξ
to make P (j|i, θ, ξ) close to dij in some sense.
• If we have only product-level data in the form of market shares sj for each product,
we will choose θ and ξ to make P (j|θ, ξ) close to sj .
5.1 Utility maximization subject to a budget constraint
• We can frame this problem as a standard case of maximizing utility subject to a budget
constraint.
• Let qj be the quantity purchased of alternative j.
• Let z be the quantity consumed of all other goods. We normalize the price of z to
zero.
• Let yi be the budget of consumer i.
• To reflect the choice situation we are modelling,
– qj can only take on the values 0 and 1 (either one or no units)

– the total number of units (across j) is one (exactly one alternative chosen)
• The utility maximization problem is then:
max U (q1 , . . . , qJ , z)
q1 ,...,qJ ,z
 PJ
 j=1 pj qj + z = yi
s.t. qj ∈ {0, 1}
 PJ
j=1 qj = 1.
32
• Suppose the utility function is
J
X
U (q1 , q2 , . . . , qJ , z) = (xj β + eij )qj + αz.
j=1
PJ
• Substituting in the budget constraint z = yi − j=1 pj qj , we get, for each j, the
conditional (on j) indirect utility functions:
U (1, 0, . . . , 0, yi − p1 ) = x1 β − αp1 + ei1 + αyi

U (0, 1, . . . , 0, yi − p2 ) = x2 β − αp2 + ei2 + αyi
...
U (0, 0, . . . , 1, yi − pJ ) = xJ β − αpJ + eiJ + αyi .
• The final step in the utility maximization problem is then to choose the j that has the
highest conditional indirect utility.
• Since the term αyi is does not affect this choice, we drop it for each j by defining
ui1 = U (1, 0, . . . , 0, yi − p1 ) − αyi
and similarly for j = 2, . . . , J.
• This results in the conditional indirect utility function (5.1).
• Other utility specifications are also possible.
• For instance, if
J
Y
α
U (q1 , q2 , . . . , qJ , z) = z exp[(xj β + eij )qj ],
j=1
substitution of the budget constraint and taking logs result in:
ln U (1, 0, . . . , 0, yi − p1 ) = α ln(yi − p1 ) + x1 β + ei1 ,
etc.
• Note that in this case the income term yi no longer cancels from the comparison across
j.
33
5.2 The outside good and normalizations
• We designate one of the alternatives as the ‘outside good’ denoted j = 0, so that there
is a total of J + 1 alternatives when we count the outside good.
• Usually the outside good is the alternative of not buying any of the products in the
market, but it could also be ’the other products’ aggregated into one alternative if we
only model alternatives above a minimum market share.
• Whether we have individual-level data or not, we always assume that the term ξj has
the value zero for j = 0:
ξ0 = 0.
• It is also common to assume that the remaining parts of uij , other than εij , are also
zero for j = 0, so that ui0 = εi0 .
• Either of these assumptions are normalizations. That is, they do not limit the pattern
of choices that the model can predict, but rather ensures that each set of parameters
and ξ’s correspond to different choice predictions.
• To see this more clearly, suppose we did not impose ξ0 = 0. Then we could add any
constant c to ξ0 as well as to all the other ξj , and predicted choices would be unchanged.
• We could also multiply every utility uij by some c > 0 without changing the model’s
predictions.
• This implies that we can impose another normalization, which is usually achieved by
fixing the variance of εi0 .
• It is also common to impose that all j have the same variance, but this is an actual
restriction, not a normalization.
5.3 The role of εij and common choices for its distribution
• Suppose for a moment that there is no εij -term in our model (so εij = 0 with probability
1 for all i and j).
• Then, with product-level data only, our model’s prediction for whether the consumer
chooses j is simply 1(j = arg maxk (xj β − αpj + ξj )).
• Since this indicator function takes on the same value for all i, and equals 1 for the
‘best’ j and 0 for all other j, the model would predict that all consumers choose the
product with the highest value for xj β − αpj + ξj .
34
• The heterogeneity represented by this random shock is therefore necessary to get pos-
itive choice probabilities for all alternatives j.
• With individual-level data, εij allows for the possibility that there are still consumer-
specific unobservables that affect choices. Otherwise our model would imply that all
consumers with the same zi choose the same alternative.
• The most common choice of distribution for εi = (εi0 , . . . , εiJ )0 is that they are i.i.d.
type-1 extreme value, in which case the integral in (5.4) has the following analytical
solution:6
Z h i
P (j|θ, ξ) = 1(j = arg max(xj β − αpj + ξj + εij )) f (εi )dεi
k
exp(xj β − αpj + ξj )
= PJ (5.7)
1 + j 0 =1 exp(xj 0 β − αpj 0 + ξj 0 )
1
P (0|θ, ξ) = PJ . (5.8)
1 + j 0 =1 exp(xj 0 β − αpj 0 + ξj 0 )
• When εi is type-1 extreme value, we have what is called a logit model.
• The other common choice of distribution is multivariate normal with mean zero and an
estimated (symmetric) covariance matrix (but with one variance normalized to zero):
     
εi0 0 1 σ01 . . . σ01
 εi1  0  σ01 σ11 . . . σ1J 
εi =  ..  ∼ N (0, C) = N  ..  ,  ..
     
. . .. 
 .   .   . . . 
εiJ 0 σ0J σ1J . . . σJJ
• This choice gives us what is called a probit model.
• It is of course possible to restrict some or all of the σjk -parameters to be zero, or equal
to each other, to reduce the number of parameters one needs to estimate.
• Usually the full covariance matrix of a probit model is only estimated in settings where
J is very small, like 2 or 3.
• For the probit model there is no analytical solution to the integral in (5.4), so it must
be approximated by a technique called simulation.
6
See Train (2009): Discrete-choice methods with simulation, 2nd ed. for a proof of this statement, and
also for more discussion of normalizations and the distribution of εi .
35
• Software such as Matlab has inbuilt functions that lets us take “draws”, realizations,
from random variables with a chosen distribution.
(r)
• Suppose we have taken R (independent) draws νi from a J + 1-dimensional standard
normal distribution.
• Then if AA0 is the Cholesky decomposition of the covariance matrix C, the distribution
(r) (r)
of the R J × 1 vectors εi = Aνi is approximately N (0, C).
• Then for large R, the following approximation holds:

R
1 Xh (r)
i
P (j|θ, ξ) ≈ 1(j = arg max(xj β − αpj + ξj + εij )) (5.9)
R r=1 k
5.4 The role of ξj in fitting the model to the data
• The ξj -term plays the double role of (i) ensuring that the model fits the data, and (ii)
allowing for unobserved demand shocks that are correlated with price. We will return
to (ii) later, and focus on (i) in this subsection.
• Consider the following random experiment. There PJ are j = 1, . . . , J different alterna-

tives, each chosen with probability ρj , so that j=1 ρj = 1.
• Suppose we can take repeated draws, i = 1, . . . , n from this distribution, each of which
results in a choice of one j between 1 and J. (We assume that the probabilities ρj
remain unchanged as we take these draws).
• Let Xj be the random variable representing the number of times alternative j is chosen
out of a total of n draws. The random variables Xj (for each j are said to have a
multinomial distribution.
• Random variables with a multinomial distribution have mean and variance E(Xj ) =
nρj and Var(Xj ) = nρj (1 − ρj ).
• Consider some differentiated products market, such as that for new cars. Assume there
are J different alternatives and that each buyer chooses exactly one of them.
• Suppose the probability that a randomly picked buyer of a new car chooses alternative
j is ρj = Prob(j).
• Then, if we take a random sample of size n of buyers of new cars, the number of buyers
of alternative j is given by the multinomial random variable Xj .
36
• Therefore the observed market share sj in a sample of size n is Xj /n and has mean
and variance
E(sj ) = E(Xj /n) = E(Xj )/n = ρj ,

Var(sj ) = Var(Xj /n) = Var(Xj )/n2 = ρj (1 − ρj )/n.
• It follows from this that the variance Var(sj ) becomes negligible for large n, so that
the observed market share will in practice be exactly equal to the probability ρj when
n is on the order of several million, for instance.
• With individual-level data, on the other hand, it is clear that the “market share” Xj
for a single individual, which is either 0 or 1, has a substantial variance, called sampling
error, and that it will therefore not equal ρj .
• When modelling a discrete-choice situation, for instance with a model like (5.4), we
are trying to model the probability ρj .
• Concretely, we assume that our model is correct, in the sense that there are ‘true
values’ of (θ, ξ), which we denote (θ0 , ξ0 ), such that
ρj = P (j|θ0 , ξ0 ), (5.10)
where P (j|θ, ξ) is given by the functional form (5.4).
• Then, when the observed market shares sj have been generated from a large enough
sample of consumers that Var(sj ) = 0, so sj = ρj , the assumption (5.10) that our
model is correct implies that
sj = P (j|θ0 , ξ0 ). (5.11)
• Next consider the case of individual-level data.
• We write the population conditional choice probability given demographic variables z

as ρj (z) = Prob(j|z).
• Now dij = 1[consumer i is observed to choose j] is a multinomial random variable with

n = 1, so
Var(dij |zi ) = ρj (zi )[1 − ρj (zi )] > 0.
• The claim that our model (5.6) is correct now amounts to the statement
ρj (zi ) = P (j|i, θ0 , ξ0 ). (5.12)
37
• But since Var(dij |zi ) > 0, dij 6= E(dij |zi ) = ρj (zi ), so that even if (5.12) holds (i.e. the
model is correct), we get
dij 6= P (j|i, θ0 , ξ0 ). (5.13)
• We have shown that when market shares come from a large number of consumers, the
absence of sampling error implies that at the true parameter values, the model must
correctly predict market shares.
• However, no matter what values are chosen for (α, β), a model without the ξj -term
would typically not have enough degrees of freedom (enough “moving parts”) to allow
sj = P (j|θ, ξ = 0)
to hold for all j.
• This is closely analogous to the fact that in a linear regression model yi = xi θ + ei ,

there is no value of θ that fit the data perfectly without the error term ei . That is,
there is no value of θ that solves the i = 1, . . . , n equations
yi = xi θ,
apart from in the extreme case where n is equal to the number of columns (and rank)
of xi .
• Another way of saying that we cannot fit the observed market shares when ξ = 0 is
that
sj = P (j|θ, ξ = 0) + ωj (5.14)
for some non-zero error term ωj .
• The problem with this is that it is very hard to justify the presence of ωj , since it
cannot, as we have seen, be explained by sampling error.
• An alternative justification is that ωj represents a systematic tendency for consumers

to prefer (if ωj > 0) or dislike (if ωj < 0) alternative j.
• But this kind of justification for ωj would contradict the structural starting point of
our model, i.e. the claim that xj − αpj + εij represents the surplus (indirect utility)
the consumer gets from choosing j.
• That is, if there was such a tendency for consumers to prefer or dislike j, it should
show up in uij , not added onto the choice probability.
38
• And this is precisely the function served by ξj ; it represents a tendency for consumers
to prefer or dislike j, and in line with this structural interpretation, it enters the utility
function (5.3).
• With product-level data, ξj is really just another error term, just like ei in the linear
regression model yi = xi θ + ei — that is, whatever needs to be added to the xi θ to get
yi .
• Although it is not usually emphasized in econometrics textbooks, the definition of the

error term is simply
ei = yi − xi θ; (5.15)
so ei is not really like just another variable; it’s simply whatever is needed to fill in the
difference between xi θ and the dependent variable yi .
• In our structural discrete-choice demand model with product-level data, ξj is also an

error term in the same sense; it is whatever is needed to make sure that
sj = P (j|θ, ξ) (5.16)
• The only difference is that our structural interpretation of the uij and P (j|θ, ξ) dictates
that we can have an error term in the former, but not in the latter.
• In conclusion, then, ξ is a vector of error terms defined as the solutions to the system
of equations given by (5.16) for each j, just like the vector (e1 , . . . , en ) of error terms
in the linear regression model is the solution to (5.15) for each i.
• When we have individual-level data, the interpretation of ξj is different. Since there is

no longer a ξj for each observation (unlike in the case of product-level data), it is now
a product-specific constant.
• Similarly to (5.16), we can define the vector ξ as the solution to

n
1X
sj = P (j|i, θ, ξ), (5.17)
n i=1
1
Pn
where sj = n i=1 dij .
39
5.5 Estimation
• In the case of product-level data, by defining ξ through (5.16), we exclude the possibility
of estimating θ with GMM using moment conditions of the form
zj [sj − P (j|θ, ξ)],
since the value would be zero for each j and any value of θ.
• The analogy from the linear regression model would be to use zi [yi − (xi θ + ei )] as a
moment. Again it would be zero for all i and all values of θ.
• By contrast, individual-level data allow us to either
1. estimate both θ and ξ as parameters (i.e. without distinguishing between the two
in any particular way) with GMM, using moments
zij [dij − P (j|i, θ, ξ)], (5.18)
or, alternatively,
2. let ξ(θ) be defined as an implicit function of θ by (5.17), and estimate θ only by
GMM, using moments
zij [dij − P (j|i, θ, ξ(θ))]. (5.19)
• An alternative to GMM is maximum likelihood estimation.
• The likelihood for consumer i, or equivalently, the probability of observing consumer

i’s actual purchase is
J
Y
Li (θ, ξ) = P (j|i, θ, ξ)dij
j=1
where dij is one for the particular j chosen by i and zero for all other j.
• The likelihood for all consumers i = 1, . . . , n is

n
Y n Y
Y J
ind
L (θ, ξ) = Li (θ, ξ) = P (j|i, θ, ξ)dij .
i=1 i=1 j=1
In the case of product-level data, P (j|i, θ, ξ) = P (j|θ, ξ) and sj = n1 ni=1 dij , or

P
• P
n
i=1 dij = nsj , where n is the number of consumers that went into the calculation of
40
the market share sj . We then get the simplification
J J
Y Pn Y
prod i=1 dij
L (θ, ξ) = P (j|θ, ξ) = P (j|θ, ξ)nsj .
j=1 j=1
• We can take the natural logarithm of the likelihood function to get, for individual-level
and product-level data, respectively:
n X
X J
ind
log L (θ, ξ) = dij log P (j|i, θ, ξ)
i=1 j=1
J
X
prod
log L (θ, ξ) = n sj log P (j|θ, ξ),
j=1
where the scaling by n can be ignored, since it does not affect the value of θ that
maximizes the function.
• Similarly to with GMM, individual-level data allow us to either
1. estimate both θ and ξ as parameters (i.e. without distinguishing between the two
in any particular way) with maximum likelihood, so
[
(θ, ξ)M L = arg max log Lind (θ, ξ),
θ,ξ
or, alternatively,
2. let ξ(θ) be defined as an implicit function of θ by (5.17), and estimate θ only by
maximum likelihood:
θ̂M L = arg max log Lind (θ, ξ(θ)).

θ
• With product-level data, defining ξ by (5.16) causes the same problem in maximum
likelihood estimation as with GMM: the model’s prediction P (j|θ, ξ) does not depend
on the value of θ since ξ adjusts to ensure that the model’s prediction always equals
sj . That is, we get
J
X
prod
log L (θ, ξ(θ)) = n sj log P (j|θ, ξ)
j=1
J
X
= n sj log sj
j=1
41
for all θ.
• Note that setting ξ = 0 is problematic under maximum likelihood estimation for the
same reason as discussed above for GMM: sampling error cannot explain why (5.16)
does not hold, and nothing else can either.
• However, since maximum likelihood estimation does not rely on explicitly defining an
error term like ωj in (5.14), there is a greater danger that this fact is overlooked.
• But in addition to the discrepancy between observed and predicted market shares, the
maximum likelihood estimator
θ̂ = arg max log Lprod (θ, ξ = 0)

θ
is not consistent, because it does not take into account with the fact that price is
endogenous.
• Intuitively, with ξ = 0, there is nothing that can explain why some products with high
price also have high market shares, other than a low price sensitivity. Therefore α will
be (asymptotically) biased towards zero.
• The same would be true for other restrictions on ξ such as assuming that it is inde-
pendently distributed across j.
• Since the suggested GMM and maximum-likelihood estimators both fail for product-
level data, an alternative approach is needed: GMM estimation of θ only, with moments
zi ξj (θ) (5.20)
where ξ(θ) is the implicit function defined by (5.16).
• Note how this is again equivalent to the linear regression case. The equation (5.16)
serves only to say that the error terms ξj should ensure exact fit with the dependent
variable, just like the equation yi = xi θ + ei does for the linear regression model.
• In both cases, these definitions of the error term in themselves provide no information
about the value of θ, and additional restrictions are needed, in the form of orthogonality
assumptions between exogenous variables (instruments) and the residuals: E[zj ξj (θ)] =
0 for the structural demand model, and E[zi ei (θ)] = 0 for the linear regression model
(where ei (θ) = yi − xi θ).
5.6 Finding ξj
• Four main approaches are used to find ξ.
42
• The conceptually simplest, but in practice often the least convenient, alternative, is to
use some standard numerical algorithm to solve the system of equations given by the
equations (5.16) for each j. (For an example, see Goolsbee and Petrin (2004): The
Consumer Gains from Direct Broadcast Satellites and the Competition with Cable TV,
Econometrica.)
• If εij has a type-1 extreme value distribution, so the choice probabilities are given by
(5.7) and (5.8), there is an analytical solution to (5.16)7
ξj = log(sj /s0 ) − (xj β − αpj ).
• Another approach, is slightly more involved than the previous one, but more widely
applicable.
• It relies on the fact that for some commonly used discrete-choice models, for any value
of θ, the function g : RJ → RJ defined as
g(ξ) = ξ + log s − log P (θ, ξ), (5.21)
where s = (s1 , . . . , sJ )0 and P (θ, ξ) = [P (1|θ, ξ), . . . , P (J|θ, ξ)], is a so-called contrac-
tion (or contraction mapping). (This is proved in BLP (1995).)
• Any contraction g : RJ → RJ has two important features:8
1. g has a unique fixed point ξ ∗ , i.e. a point such that ξ ∗ = g(ξ ∗ ), and
2. For any starting point ξ 1 ∈ RJ , the sequence (ξ t ) defined by the recursive relation
ξ t+1 = g(ξ t ) converges to the fixed point ξ ∗ .
• If and only if ξ is a fixed point of g, log sj − log P (j|θ, ξ) = g(ξ) − ξ = 0, or equivalently

s = P (θ, ξ), by (5.21). Therefore the fixed point is the unique solution to (5.16).
• The sequence gives us a procedure to find the solution.
• Define some small number τ , e.g. τ = 1e − 12.
• Starting from any ξ 1 ∈ RJ , let ξ 2 = g(ξ 1 ), etc.

7
Substituting:
sj /s0 sj sj
P (j|θ, ξ) = PJ = PJ = = sj .
1 + j 0 =1 sj /s0
0 s0 + j 0 =1 sj 1
8
See e.g. Rudin: Principles of Mathematical Analysis, 3rd. ed., p. 220.
43

• Then there is an integer T such that t > T implies maxj sj − P (j|θ, ξ t ) < τ , in other
words such that (5.16) is satisfied to whatever level of accuracy we desire.
• The final approach is conceptually slightly different, because it does not involve finding
ξ(θ) for every value of θ attempted during the optimization of the GMM or likelihood
objective.
• Instead it recasts the problem as a minimization problem with respect to both θ and
ξ, subject to the constraint that (5.16) hold (at the solution), and relies on established
algorithms for solving such problems to solve

[
(θ, ξ) = arg min F (θ, ξ) subject to s = P (θ, ξ) ,
θ,ξ
where F is the GMM objective function or the negative of the log likelihood. (See
Dube, Fox and Su (2012), Econometrica.)
5.7 Random coefficients
• As mentioned briefly in section 1.3, we are often interested in the extent to which
consumers switch their purchases to other firms in response to an increase in price.
• Given the model (5.3) with εi type-1 extreme value, the resulting choice probabilities
(5.7) — which for simplicity we now write as Pj = P (j|θ, ξ) — have price derivatives
∂Pj exp(xj β − αpj + ξj )

= PJ (−α)
∂pj 1 + j 0 =1 exp(xj 0 β − αpj 0 + ξj 0 )
exp(xj β − αpj + ξj )
− PJ exp(xj β − αpj + ξj )(−α)
[1 + j 0 =1 exp(xj 0 β − αpj 0 + ξj 0 )]2
= −αPj (1 − Pj ) < 0.
and for k 6= j,
∂Pj exp(xj β − αpj + ξj )

= − PJ exp(xk β − αpk + ξk )(−α)
∂pk [1 + j 0 =1 exp(xj 0 β − αpj 0 + ξj 0 )]2
= αPj Pk > 0.
• The cross-price derivatives show that in this model, substitution from j to k is pro-
portional to the market share of k.
• This follows from the fact that εi is i.i.d.: consumers who leave j after its price goes up
have the same probability of preferring k as any randomly chosen consumer. Therefore,
44
the share of these consumers who will switch to k is equal to Pk .
• This implication of the model is often unrealistic, because we would expect substitution
to be stronger between products that are similar.
• The probit model with estimated covariance matrix is one way of getting this effect,
but when J is large, it may not be feasible.
• An alternative is to relax the assumption that all consumers have the same coefficient
on xj and instead allow it to vary across consumers as indicated by writing βi .
• One way of doing this is to assume that for each consumer, βi is drawn from a multivari-
ate normal distribution, i.i.d across consumers, with estimated mean β and standard
deviation given by the diagonal matrix σ, so that
βi = β + σνi (5.22)
where νi has a multivariate standard (i.e. zero mean and identity covariance matrix)
normal distribution, with p.d.f. denoted by φ.
• The choice probability is then

Z Z h i
P (j|θ, ξ) = 1(j = arg max[xj (β + σνi ) − αpj + ξj + εij ]) φ(νi )f (εi )dεi dνi
k
Z
= Pj (νi )φ(νi )dνi , (5.23)
where now θ = (α, β, σ) and
exp[xj (β + σνi ) − αpj + ξj ]

Pj (νi ) = PJ . (5.24)
1 + j 0 =1 exp[xj 0 (β + σνi ) − αpj 0 + ξj 0 ]
• As discussed in the context of the probit model, the integral can be simulated as
R
1X
Pj ≈ Pj (ν (r) ) (5.25)
R r=1
where R is the number of vectors ν (r) drawn from the standard normal distribution.
45
• We then get:
R
∂Pj 1 X ∂Pj (ν (r) )
≈
∂pk R r=1 ∂pk
R
1X
= αPj (ν (r) )Pk (ν (r) ).
R r=1
• The random coefficient on xj causes products with similar product characteristics to

have a higher cross-price elasticity, because it creates a positive correlation between
their conditional choice probabilities Pj (νi ).
• To see this, first note that a different way of saying that two variables are positively
correlated is that they tend to be either both large or both small.
• For instance, suppose X and Y are two random variables with discrete support {0.1, 0.9}
and probability 0.5 for each outcome.
• If they are always both high or both low, their expected product is E(XY ) = 0.5(0.9)2 +
0.5(0.1)2 = 0.41.
• If they are never both high or both low, their expected product is E(XY ) = 0.5(0.9)(0.1)+
0.5(0.1)(0.9) = 0.09
• In the same way, if two products j and k have similar product characteristics, xj σνi
and xk σνi will be positively correlated across different values of νi .
• In turn, Pj (νi ) and Pk (νi ) will be positively correlated, giving a high average product
Pj (νi )Pk (νi ) and therefore a high cross-price derivative.
• This explains why random coefficients on the product characteristics xj allow similar
products to have higher cross-price derivatives than products with different attributes.
• Note that a similar effect can be obtained by interacting product characteristics with
observed consumer attributes zi , as in (5.2).
• However, even if such data are available, it is common to include random coefficients,
because it is unlikely that zi includes all determinants of tastes βi , in which case the
random coefficients capture remaining heterogeneity in consumers’ tastes for xj .
5.8 Practical issues in estimation
• Suppose we have product-level data, and want to estimate a random-coefficients logit

demand model using moments like (5.20).
46
• Given a weighting matrix W , we can use a numerical optimization algorithm to solve
" J #0 " J #
X X
min zj ξj (θ) W zj ξj (θ) . (5.26)
θ
j=1 j=1
• However, we can make our computations more efficient by treating the ‘fixed’ param-
eters (α, β) differently from those, σ, that multiply a quantity, νi , that varies across
consumers.
• Define
δj = xj β − αpj + ξj , (5.27)
i.e. the sum of all utility components that are fixed across consumers.
• So far we have used the equation (5.16) to implicitly define the function ξ(θ).
• But we can just as well define an implicit function δ(σ) with the system of J equations
s j = Pj
Z
exp(xj σνi + δj )
= PJ φ(νi )dνi . (5.28)
1 + j 0 =1 exp(xj 0 σνi + δj 0 )
• Since δj absorbs xj β −αpj , α and β do not appear anywhere in this system of equations.
• Therefore the value of δ that solves the system of equations does not depend on α and
β, but only on σ.
• All the techniques discussed in section 5.6 for finding ξ also work for finding δ.
• The problem (5.26) can now be expressed as

" J
#0 " J
#
X X
min zj (δj (σ) − xj β + αpj ) W zj (δj (σ) − xj β + αpj ) .
α,β,σ
j=1 j=1
• If we let X be a matrix that vertically stacks the vectors [xj , −pj ] for each observation
j, and Z does the same for zj , and write θ1 = σ and θ2 = (β 0 , α)0 , our estimator for
(θ̂1 , θ̂2 ) is
(θ̂1 , θ̂2 ) = arg min [(δ(θ1 ) − Xθ2 )0 Z] W [Z 0 (δ(θ1 ) − Xθ2 )] . (5.29)

θ1 ,θ2
47
• For each value of θ1 , the optimal choice of θ2 , written as θ̃2 , will satisfy the first-order
condition
∂
0 = [(δ(θ1 ) − X θ̃2 )0 Z]W [Z 0 (δ(θ1 ) + X θ̃2 )]
∂θ2
= −2X 0 ZW [Z 0 (δ(θ1 ) − X θ̃2 )]
X 0 ZW Z 0 X θ̃2 = X 0 ZW Z 0 δ(θ1 )
θ̃2 (θ1 ) = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 δ(θ1 ), (5.30)
where in the last line the notation reflects the fact that θ̃2 is a function of θ1 .
• Since we now have an analytical expression for θ̃2 in terms of θ1 , we can solve the
problem (5.29) by using a numerical optimization algorithm to find θ̂1 and then use
(5.30) to find θ̂2 :
h i h i
θ̂1 = arg min (δ(θ1 ) − X θ̃2 (θ1 ))0 Z W Z 0 (δ(θ1 ) − X θ̃2 (θ1 )) (5.31)
θ1
θ̂2 = θ̃2 (θ̂1 ). (5.32)
• Note that (5.30)-(5.32) and (5.29) define the same estimator. All we have done is to
derive an alternative calculation method, sometimes called concentrating out the linear
parameters.
• This analysis also applies to the special case where θ1 = 0, i.e. where there are
no random coefficients (or other coefficients on utility components that vary across
consumers).
• Since there is no θ1 to find by numerical minimization, (5.31) is no longer needed.
• As before, δ is the solution to the system of equation sj = Pj for all j.
• But now there are no longer any nonlinear parameters θ1 that affect δj , which is now
simply the solution to:
exp(δj )
sj = PJ . (5.33)
1 + j 0 =1 exp(δj 0 )
• The expression (5.30) for the linear parameters θ2 is now modified slightly by the fact
that δ is not a function of any θ1 , so we get:
θ̂2 = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 δ, (5.34)
48
where (5.33) now has the straightforward solution
δj = log(sj /s0 ).
• Finally, it is useful to note that (5.34) is simply the GMM estimator for the linear
model
log(sj /s0 ) = xj β − αpj + ξj
with moment conditions E(zj ξj ) = 0, as can easily be seen from section 4.4, and (4.13)
in particular.
5.9 The firm’s pricing problem
• In section 1.2 we saw how the prices set by a firm must satisfy first-order conditions
for profit maximization that depend on which products the firm owns.
• Defining quantity as the product of the number of consumers times the choice proba-
bility Qj = nPj , and noting that the market size n cancels, the first-order conditions
(1.4) and (1.6) for single-product and two-product firms, respectively, are
∂Pj
0 = Pj + (pj − mcj ) , j = 1, 2 (5.35)
∂pj
∂Pj ∂Pk
0 = Pj + (pj − mcj ) + (pk − mck ) , j = 1, 2; k 6= j. (5.36)
∂pj ∂pj
• In given market for differentiated products, let Ff be the set of products owned by
firm f , and let f (j) denote the firm that owns product j.
• Assuming J = 2 and single-product firms, (5.35) implicitly defines the best-response

function of firm f (j) for different values of the other firm’s price pk (which enters
through Pj and its derivatives).
• When (5.35) is satisfied for j = 1, 2, each firm’s price is its best response to the other
firm’s price — that is, the prices (p1 , p2 ) are a Nash equilibrium in a game where firms
choose prices to maximize profits.
• In principle this game could have more than one Nash equilibrium, but it is a standard
assumption (which does not appear to be violated in practice) that (5.35) has a unique
solution.
49
• Define the ownership matrix

1(f (1) = f (1)) 1(f (2) = f (1))
Own = ,
1(f (1) = f (2)) 1(f (2) = f (2))
whose (j, k) entry is an indicator function for whether product k is owned by the same
firm as product j.
• Every diagonal entry (j, j) is necessarily one, and the matrix is symmetric (i.e. the
entries (k, j) and (j, k) are equal).
• When J = 2, the ownership matrix must take one of two shapes, corresponding to two
single-product (“sp”) firms or one two-product (“tp”) firm, respectively:

sp 1 0 tp 1 1
Own , Own = .
0 1 1 1
• Next define vectors of prices, marginal cost and choice probabilities, as well as the
(Jacobian) matrix of partial derivatives with respect to prices:
" #
∂P1 ∂P1
p1 mc1 P1 ∂p1 ∂p2
p= , c= , P = , ∇p P = ∂P ∂P2
p2 mc2 P2 2
∂p1 ∂p2
• Taking the element-by-element product (denoted .∗, as in Matlab notation) of the

ownership matrix and the transpose of the negative of the derivatives matrix
" #
∂P1
0
Ωsp = Ownsp . ∗ (∇p P )0 = ∂p1 ∂P2
0 ∂p2
" #
∂P1 ∂P2
Ωtp = Owntp . ∗ (∇p P )0 = ∂p1
∂P1
∂p1
∂P2 .
∂p2 ∂p2
• The system of first-order conditions (5.35) and (5.36) can now be written on matrix
form as
0 = P + Ωsp (p − c) and
0 = P + Ωtp (p − c),
respectively.
• All these matrices (and vectors) extend to arbitrary numbers of products J and own-
50
ership structures. For instance, if J = 4 and F1 = {1, 4} and F2 = {2, 3},
 
1 0 0 1
0 1 1 0
Own =  0 1
,
1 0
1 0 0 1
while F1 = {1, 3, 4} and F2 = {2} give

 
1 0 1 1
0 1 0 0
Own = 
1
.
0 1 1
1 0 1 1
• The fully general matrix form of the first-order conditions is then
−Ω(p) · (p − c) = P (p), (5.37)
where the fact that both choice probabilities and their derivatives depend on prices
has now been reflected in the notation.
• We can easily solve the system for the markups or marginal cost implied by the observed
prices and estimated demand system (as given by P and ∇P ) (under the assumptions
that the observed prices satisfy the first-order conditions):
p − c = −Ω−1 (p) · P (p)

c = p + Ω−1 (p) · P (p).
• Solving for new optimal prices (for instance if Own changes) is more complicated, since
the system of equations (5.37) is nonlinear in p. But writing (5.37) in terms of prices
is still conceptually useful:
p = c − Ω−1 P.
51
6 Berry, Levinsohn and Pakes (1995) [BLP]
6.1 Brief overview
• The paper proposes a feasible method for estimating the demand for differentiated
products:
– using product-level data only

– taking into account unobserved product characteristics that are correlated with
price (using IV for price, even though price enters the model nonlinearly)
– with random coefficients (to let substitution depend on the similarity of products).
• The contribution of the paper is mainly methodological, so we will focus on how they
estimate their model, and not on their results.
6.2 Data
• Unit of observation: car model, denoted j.
• 999 different car models in an unbalanced (not every model observed every year) panel
of 20 years (years denoted t) (1971-1990).9
• Total of 2217 model-year observations jt.
• For each model-year the data contain:
– The number of units sold qjt

– Price pjt (deflated by the CPI to 1983 dollars)
– A number of product characteristics such as horsepower, length etc. (see bottom
of p. 868 for a full list).
– The firm that owns j.
• From other data sources the paper uses the following information:
– Mean mt (for each of the 20 years) and standard deviation σ̂y (common to all
years) of household income.
– Number Mt of households in the US in each year of the data.
9
It is stated in the paper (p. 869) that there are 997 models, but the replication by Gentzkow and
Shapiro indicates that the actual number is 999. See https://www.brown.edu/Research/Shapiro/pdfs/blp
replication.pdf and https://www.brown.edu/Research/Shapiro/data/blp 1995 replication.zip.
52
• Define market shares
qjt
sjt = .
Mt
Let Jt be the set of all models sold in year t and define the ‘outside good’ j = 0 as the
option of not choosing any j in Jt , i.e. of not buying a new car. Then
X
q0t = Mt − qjt ,
j∈Jt
or, dividing by Mt ,
X
s0t = 1 − sjt .
j∈Jt
• Define the data vectors

x = 1 HP/W eight Air M iles/Dollar Size ,

w = 1 ln(HP/W eight) Air ln(M iles/Gallon) ln(Size) T rend ,
where Air is an indicator for air conditioning, Size is width times length and T rend
is simply t, where we assume t = 1, . . . , 20.
• Let K1 be the length of x and K2 the length of w.
• For each jt, let xjt be the value of x, and let xjtk denote the k-th entry of x.
6.3 Demand model
• Consumer i at t gets (indirect) utility from choosing alternative j:

1 K
pjt X
uijt = δjt − α + xjtk σk νik + εijt (6.1)
yit k=1
δjt = xjt β̄ + ξjt (6.2)
yit = exp(mt + σ̂y νiy ) (6.3)
ui0t = εi0t , (6.4)
where νi = (νiy , νi1 , . . . , νiK1 )0 is a random vector with a multivariate normal distribu-
tion with mean zero and identity covariance matrix, whose probability density function
we denote by φ. The shocks εijt are i.i.d. type-1 extreme value.10
10
These equations differ from those on p. 868 in the paper in two respects: (1) the paper has ln(yit − pjt ),
and (2) we have normalized the utility of j = 0 to be zero apart from the logit shock. The latter is a
53
• The specification of income yit comes from the assumption that household income has
a lognormal distribution in the population.
• This specification implies choice probabilities

p
exp(δjt − α yjtit + K
P 1
k=1 xjtk σk νik )
Z
Pjt = pj 0 t φ(νi )dνi , (6.5)
1 + j 0 ∈Jt exp(δj 0 t − α yit + K
P P 1
k=1 xj 0 tk σk νik )
which can be simulated as discussed above.
• The observed market share sjt allow us to define δjt as an implicit function of the data
and the nonlinear parameters θ1 = (σ1 , . . . , σK1 , α):
sjt = Pjt [δt (θ1 ), θ1 ], for all j ∈ Jt
where the vector δt stacks all the δjt vertically.
• This is a system of equations whose number of unknowns and equations equals the
number of elements in the set Jt . There is one such system of equations for each year
of data.
• For each t, given a value of θ1 , we find δt (θ1 ) using the contraction method described
above.
• Once δjt (θ1 ) has been found, we have an expression for ξjt in terms of the data and
parameters:
ξjt = δjt (θ1 ) − xjt β̄. (6.6)
This is defines the error term for the demand equation in BLP.
6.4 Supply model
• The discussion in section 5.9 shows that the demand function combined with product
ownership has implications for the relationship between price and marginal cost for
each product.
• Defining, for one market t at a time, the combined ownership / price-derivative matrix
Ωt like in section 5.9, the implied vector of markups bt = pt − mct is
bt = −Ω−1
t Pt
normalization. Regarding (1), income is in many cases smaller than price, in which case the log specification
is not defined. Again according to Gentzkow/Shapiro, it seems that the functional form pjt /yit was used in
practice in BLP’s code.
54
where bt , prices pt and Pt are all column vectors of length equal to the number of
elements in Jt , and Ωt is a square matrix of corresponding dimensions. Pt is the vector
of choice probabilities given by (6.5).
• Since bt = pt − mct , we can write the marginal cost implied by the (1) observed prices,
(2) observed product ownership, and (3) choice probabilities (and their derivatives) as:
mct = pt − bt .
• Let bjt denote the (scalar) entry of bt corresponding to model j in t.
• Marginal cost is assumed to be constant (in output quantity) and its log a linear
function of product characteristics:
ln(mcjt ) = wjt γ + ωjt . (6.7)
• Combining the implied markup bjt with this function for marginal cost, we get
ωjt = ln(pjt − bjt ) − wjt γ. (6.8)
This defines the error term for the supply equation in BLP.
6.5 Moments
• Suppose we have two row vectors of instruments z1jt and z2jt , of size 1 × L1 and 1 × L2 ,
respectively.
• Define the 2 × (L1 + L2 )× block-diagonal instrument matrix

z1jt
Zjt = .
z2jt
• Also arrange the error terms ξjt and ωjt , defined in (6.6) and (6.8) respectively, in a
2 × 1 vector:

ξjt
ujt = .
ωjt
• The paper estimates two equations using data with a panel structure (J products
observed in multiple years).
55
• Estimation is by GMM using the following (L1 + L2 ) × 1 two-equation panel moments
(see sections 4.5 and 4.6):
T
X
0
g(Wj , θ) = Zjt ujt , (6.9)
t=1
where Wj gathers all the data relevant for observation j.11
• Note that J is the number of observations here, corresponding to n in section 4.5.
6.6 Discussion of GMM requirements
• The product characteristics xjt and wjt are assumed to be exogenous, so these variables
serve as instruments for themselves.
• Price pjt is clearly endogenous, since higher unobserved quality ξjt will make it optimal
for the firm to set a higher price.
• To form instruments for price, the paper relies on the reasonable claims that markups:
– depend on whether the product has many close substitutes or not, which in turn
depends on whether product characteristics are similar to those of competitors;
– “respond differently to own and rival products”.
• Since the markup is one constituent of price, product characteristics of the firm’s own
products and of other firms’ products will be correlated with price.
• Concretely, BLP form the instruments12

h P P i
z1jt = xjt , 0
j ∈Ff (j) x jt , 0
j ∈F
/ f (j) x jt
h P P i
z2jt = w jt , w
j 0 ∈Ff (j) jt , w
/ f (j) jt .
j 0 ∈F
• These instruments must also serve to get enough (L ≥ K) moments to estimate the
σ-parameters.
• Although it is not too clear from the discussion in the paper, it seems intuitive that the
moments involving these instruments will also be informative about the σ-parameters,
11
I use an upper case ‘w’ here to distinguish it from wjt .
12
In practice, some of these must be dropped because of collinearity. See the readme file of Gentzkow and
Shapiro’s replication.
56
which are closely related to the differences in substitution patterns between similar
products vs. non-similar products.13
• As is often the case with nonlinear models the (implicit) claim that the model satisfies
requirement 4. for the GMM estimator relies on an intuitive argument that there is
a sufficient number of different sources of variation. See also the discussion in section
4.9.
• Requirement 3. for the GMM estimator is that the population mean of the moment is
zero (at the true parameter value).
• Let zt = (xjt , wjt )j∈Jt . BLP’s primitive identifying assumptions are the mean indepen-
dence conditions:
E[ξjt |zt ] = 0
E[ωjt |zt ] = 0.
• That is, it is assumed that neither the characteristics of product j itself, nor those of
other products, contain any information about the expected value of the unobserved
quality ξjt or unobserved cost shifter ωjt .
• The mean independence assumptions imply the orthogonality conditions
E[h1 (zt )ξjt ] = 0

E[h2 (zt )ωjt ] = 0,
where h1 and h2 are any functions of zt .
• It follows that if the instruments z1jt and z2jt are functions of zt , requirement 3. is
satisfied.
• Requirement 1. for the GMM estimator is that W1 , . . . , WJ are i.i.d., although weaker
assumptions may be sufficient, as mentioned in footnote 2.
• It is not obvious that independence holds, since the moment for each j explicitly
depends on the prices and characteristics of all other products j 0 sold in the same
markets, through the choice probability (6.5).
• The discussion in the paper related to this issue (mostly on p. 856) is not very trans-
parent, but at least independence across j seems to be assumed.14
13
For a recent, somewhat formal, discussion of what kind of instruments are needed in BLP-type models,
see Berry and Haile (2016): Identification in Differentiated Products Markets, Annual Review of Economics.
14
The large-sample properties of the estimator are discussed in more detail in Berry, Linton and Pakes
57
6.7 Practical issues
• The method for ‘concentrating out’ the linear parameters (see section 5.8) works in
the BLP setting and should be used.
• Still let θ1 = (σ1 , . . . , σK1 , α)0 and let θ2 = (β̂ 0 , γ 0 )0 .
• Define

δjt (θ1 )
yjt (θ1 ) =
ln(pjt − bjt (θ1 ))
and

xjt
Xjt = .
wjt
• Now let Y (θ1 ), X and Z be the vector and matrices that vertically stack yjt (θ1 ), Xjt
and Zjt .
• Based on the same reasoning as in section 5.8, we now get the following expression for
the BLP estimator:
θ̃2 (θ1 ) = (X 0 ZW Z 0 X)−1 X 0 ZW Z 0 Y (θ1 ) (6.10)

h i h i
0 0
θ̂1 = arg min (Y (θ1 ) − X θ̃2 (θ1 )) Z W Z (Y (θ1 ) − X θ̃2 (θ1 )) (6.11)
θ1
θ̂2 = θ̃2 (θ̂1 ). (6.12)
• The practical steps of estimation are (for a given W ):

ˆ
1. define a starting value θ̂1 for θ1 and call a numerical minimization algorithm
ˆ
2. calculate Y (θ̂1 )
ˆ
3. calculate θ̃2 (θ̂1 )
ˆ
4. use these to calculate the value of the GMM objective function in (6.11) at θ̂1
ˆ
5. let the numerical minimization algorithm choose a new value of θ̂1
ˆ
6. repeat steps 2.-5. until the algorithm converges and then set θ̂1 = θ̂1 .
(2004) and in appendix 1 of the NBER version of BLP. In both places the Lyapunov Central Limit Theorem
is used to show that the asymptotic distribution of the sample mean of the moments is normal. This
theorem requires independence (but not identical distribution) (see Billingsley: Probability and Measure,
3.ed., Theorem 27.3).
58
7 Nevo (2001)
7.1 Brief overview
• Paper: Measuring Market Power in the Ready-to-Eat Cereal Industry, 2001, by Aviv
Nevo.
• The model and estimation is very similar to BLP, so other than being a very nice
exposition of the BLP approach, the paper’s main contributions are to use:
– a structural demand model to formally test whether firms are colluding, by com-
paring the markups implied by the demand estimates to cost data, and
– product fixed effects to relax the mean independence / orthogonality assumption
in BLP to hold only for the market-specific deviation in unobservable quality.
• The model/estimation in this paper is essentially the same (with only very minor
differences) as in Nevo’s other paper from around the same time, RAND Journal of
Economics, 2000: Mergers with differentiated products: the case of the ready-to-eat
cereal industry.
• Instead of testing conduct using implied markups, the merger paper calculates the
counterfactual price equilibrium under the assumption that the ownership matrix
changes to reflect proposed mergers. We have already discussed how this can be done.
See the paper for the results.
7.2 Data
• Main cereal data:
– 25 brands of breakfast cereal, denoted j. (Those with highest market share in the
last quarter of the data.)
– 20 quarters (1988:Q1 - 1992:Q4).
– 65 cities, increasing over time.
– There are 1124 city-quarter combinations, denoted by t.
– All except one brand is present in all 1124 city-quarter combinations. One brand
is present only in 1989:Q1 (unclear why this is included).
– Quantity variable, qjt is the total number of servings sold (kilograms sold divided
by serving size).
– Market size Mt is inhabitants × days in a quarter, i.e. one serving per capita per
day.
59
– Given qjt and Mt , market shares of inside and outside goods are defined as in
BLP.
– Price pjt is total value of sales of the brand at t (deflated by regional urban CPI)
divided by qjt . Price varies across cities as well as quarters.
– Vector of product characteristics
xj = [cal f rom f at, sugar, mushy, f iber, all, children, adult]
(where the last three variables are dummies for products geared to different con-
sumer segments) does not vary across t.
– Brand-specific advertising expenditure ajt (seems to vary across time).
• Distribution of demographic variables
– 40 individuals drawn in each city/year, recording
Di = [income, age, 1(age < 16)]
• Other data, for instrumental variables in some specifications, but not in the full model:
– Average wage in supermarket sector in each city

– city population density
– regional price index
7.3 Model
• The model is very similar to BLP, but some differences are:
– demand model is estimated on its own (without estimating a supply model)

– utility includes product-level fixed effects
– unobserved heterogeneity in random coefficients depends on distribution of de-
mographics Di as well as standard normal shock νi
• Let xjk be the k-th element of the vector xj and let K be the length of xj .
• Let πk be a row vector of parameters (specific to xjk ) of the same length as the column
vector Di .
• Some of the elements of πk are set to zero; check rows 3-5 in Table VI, p. 327 of the
paper to see which parameters are estimated.
60
• Consumer i at t gets (indirect) utility from choosing alternative j:
K
X
uijt = δjt + pjt (π0 Di + σ0 νik ) + xjk (πk Di + σk νik ) + εijt (7.1)
k=1
δjt = dj + dq(t) + γajt − αpjt + ∆ξjt (7.2)
dj = xj β̄ + ξj (7.3)
ui0t = εi0t , (7.4)
where dj and dq(t) are dummy variables for the brand and quarter, where q(t) is the
quarter of city-quarter t (this effect is common to all cities in the quarter).
• Here νik is standard normal and εijt is type-1 extreme value.
• In (7.1), utility has been split into several parts, defined in (7.2-7.3), according to how
they are estimated.
• Choice probabilities are
Z Z
Pjt = Pjt (Di , νi ) φ(νi )h(Di )dνi dDi , (7.5)
where
exp[δjt + pjt (π0 Di + σ0 νik ) + K

P
k=1 xjk (πk Di + σk νik )]
Pjt (Di , νi ) =
1 + j 0 =1 exp[δj 0 t + pj 0 t (π0 Di + σ0 νik ) + K
PJ P
k=01 xj 0 k (πk Di + σk νik )]
and h is the probability density function of the consumer attributes Di .

• The (multiple) integral in (7.5) is simulated by using R draws (D(r) , ν (r) ) and taking
the average of the conditional choice probabilities across the draws:
R
1X
Pjt ≈ Pjt (D(r) , ν (r) ).
R r=1
• It seems R = 40 in Nevo’s paper.

• For practical purposes, the parameters πk are estimated in the same way as σk — i.e.
in the same way as σk in BLP.
• As in BLP, we define δjt as an implicit function of the data and the nonlinear param-
eters, which are now θ1 = (π0 , . . . , πK , σ0 , . . . , σK ):
sjt = Pjt [δt (θ1 ), θ1 ], for all j
61
where the vector δt stacks all the δjt vertically.
• The linear parameters that are ‘concentrated out’ are θ2 = (d1 , . . . , dJ , dq1 , . . . , dq19 , γ, α).
• This means that the data vector X in (5.30) should now stack, for each observation
(j, t), a vector consisting of a full set of product dummies as well as −pjt .
7.4 Estimation
• Estimation is similar to in BLP, but with some differences:
– The error term that goes into the moments used in estimation is ∆ξjt , i.e. the
only city-quarter specific deviation in unobserved quality ξjt after controlling for
the average (across city-quarters) unobserved quality ξj through the brand fixed
effects dj .
– To decompose dj into the effects of xj and ξj , respectively, a second stage of
estimation is needed, where dj is regressed on xj and the residuals are now the
estimates of ξj .
• The moments used for the GMM estimation are
zjt ∆ξjt ,
where ∆ξjt = δjt (θ1 ) − [dj + dq(t) + γajt − αpjt ] and zjt is a vector of instruments.
• This means that Nevo assumes E[zjt ∆ξjt ] = 0, but does not need the BLP-type as-
sumption E[zjt ξjt ] = 0 for his main results.
• This seems like a big advantage, since unobserved quality ξjt is likely to be correlated
with product characteristics or other j-specific instruments, while the city-quarter
specific deviation ∆ξjt is much more likely to be uncorrelated with the instruments.
• Whereas BLP assumed independence across j but not t, Nevo (implicitly) assumes
independence across each jt-observation.
• Since BLP has 999 j-observations, they can rely on asymptotics in the j-dimension.
• Nevo on the other hand, has only 25 j-observations, so if he defined the moments like
BLP do, there would be only 25 observations of each moment.15
15
Alternatively, he could assume independence across brands and cities, but not across quarters within a
brand-city combination. This would give a large enough number of observations (65 · 25).
62
• The second-stage regression
dj = xj β̄ + ξj
could be estimated with OLS or the GLS (generalized least squares) estimator used in
the paper (see p. 322).
• In either case this requires the same assumption as in BLP: E(xj ξj ) = 0.
• However, the parameters β̄ are not needed for any of the analysis in the paper. They
would only be needed for questions that involve changing the value of xj . The stronger
orthogonality assumption therefore plays a very minor role in the results.
7.5 Instruments
• The instrument vector zjt contain brand and quarter dummy variables and advertising
expenditure.
• The remaining parameters (for which other instruments are needed) are (see Table VI,
p. 327):
– α (1 parameter)
– σ (9 parameters)
– π (10 parameters)
• Product characteristics do not vary across city-quarters (t), so all product character-
istics or functions of product characteristics can be written as linear combinations of
the brand dummy variables.
• Therefore, the BLP instruments do not provide additional restrictions beyond those
involving the brand dummies. (See p. 320 in the paper for a discussion.)
• Instead, Nevo uses the following instruments, for each jt:
– The quarterly average price of j in other cities in the same region as the city in
the city-quarter t. Since there are 20 quarters in the data, and each quarterly
average is included as a separate instrument, this gives 19 instruments when one
is dropped to avoid collinearity (with the full set of brand dummies).
– Proxies for marginal cost: census region dummy variables (9 regions minus 1,
presumably) (transport cost), city density (cost of space) and average city earn-
ings in the supermarket sector (labour cost). This gives 10 additional moment
conditions.
63
• The prices of other cities are correlated with price because of shared components of
marginal cost, but uncorrelated with the demand shock ∆ξjt because these shocks are
assumed to be independent across cities.
• The proxies for marginal cost are assumed to be exogenous.
• Both sets of instruments should be helpful to estimate the price parameter (if we accept
the exogeneity assumptions).
• The paper does not discuss how the instruments contribute to estimating the σ and π
parameters, but in the same way as in BLP it seems plausible that instruments that
shift price and therefore market shares will provide information about the parameters
(σ and π) that determine substitution patterns. (See also footnote 13.)
7.6 Using implied markups to determine conduct
• The end goal of the paper is to determine whether firms act as if they maximize profit
(a) of brands individually, (b) of the firm’s portfolio of brands, or (c) jointly of all
brands in the market (act as a cartel).
• Using
– the asymptotic normal distribution of the estimator (with estimated mean and
variance), and
– the fact that each hypothesis about conduct and each value of the parameter
vector imply a markup ( p−mc
p
) for each brand,
Nevo calculates 95-percent confidence intervals for the median (across brands and city-
quarters) markup.
• Concretely, the procedure is:
1. Take R = 1500 (this is the number used in the paper) draws ν r from a multivariate
normal distribution with mean zero and identity covariance matrix, of dimension
equal to the number of estimated parameters (only the ones from the first stage
of estimation are needed).
2. Interact the estimated mean (parameter estimates) and covariance matrix of the
estimator with the standard normal draws to get θ̂r = θ̂ + Aν r (where AA0 equals
the covariance of the estimator).
3. For each θ̂r calculate the implied percentage markup for each jt under each of the
three hypotheses (a)-(c), and take the median across all jt.
4. Repeat steps 2. and 3. for all the R draws.
64
5. Find the 2.5-th and 97.5-th percentiles of the median markup across the R draws,
for each of the three hypotheses. These are the bounds of the confidence intervals.
• For each hypothesis i = a, b, c, let [li , ui ] denote the resulting 95-percent confidence
interval.
• Then if hypothesis i is true, and m is the true population median markup (assuming
the demand model is correct and the parameters consistently estimated):

Pr m ∈ [li , ui ] hypothesis i is true = 0.95.

• Using direct data on cost (not implied by the demand estimates), Nevo claims that
the percentage markup of a typical firm is between a low estimate of 31% and a high
estimate of 46%. These numbers should be regarded as an approximate value for the
true markup m.
• For each i the confidence interval then gives us a test of the null hypothesis that
hypothesis i true.
• Under the null hypothesis, the relevant confidence interval covers m with probability
0.95, so we will reject the null hypothesis if m is not in [li , ui ].
• Using this method, Nevo can reject hypothesis (c), of joint profit maximization, but
not the other two.
• He then concludes that firms in this industry do not appear to collude, but that their
market power is due to product differentiation (probably to a large extent driven by
marketing).
65

Empirical IO Notes v4-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Empirical IO Notes v4-1

Uploaded by

Copyright:

Available Formats

Empirical industrial organization - the basics

4 Generalized method of moments 15

• The linear regression model where f (x, θ, e) = xθ + e is one example.

• In a structural model, the function f explicitly models an economic agent’s (a consumer

• By contrast, a researcher who uses a non-structural (sometimes called ‘reduced form’)

uij = xj β − αpj + eij , (1.1)

where xj is a vector of product characteristics and pj is price.

fj (x, θ, ei ) = 1[consumer i chooses j|x, θ, e]

• Alternatively, a non-structural approach would be to estimate a linear model with J

1.1 Reasons to use a structural model

• For some such questions, for instance

it may be feasible to conduct a so-called randomized experiment where the number

• A linear regression of y on x may then give a correct answer to the question.

• In other situations, such as

y = market price of various beer brands

it is not obvious how to conduct a randomized experiment, since it is neither desirable

• A fundamental problem here is that the experiment or counterfactual we would like to

max [(p1 − mc1 )Q1 (p1 , p2 ) + (p2 − mc2 )Q2 (p2 , p1 )] ,

which has first-order conditions (for each j = 1, 2 and k 6= j):

1.3 More on the merged firm’s pricing

• We can rewrite the first-order conditions to gain some economic insight:

• In the beer example, we expect the two products to be substitutes, so ∂Q k

1.4 Practical issues in structural modelling

• Sometimes this optimization problem can be computationally burdensome to solve, for

• Instead they must be found by some iterative optimization algorithm (numerically

• We can sum up the different aspects of structural modelling as:

• Conditional expectation: E(y|x)

• Linear projection: LP(y|x) = β0 + β1 x, where

Cov(x, u) = 0, E(u) = 0, (2.2)

where u is just shorthand for y − β0 − β1 x = y − LP(y|x).

• OLS regression of y on x gives good estimate of coefficients in the linear projection.

• E(y|x) minimizes mean squared prediction error, i.e. solves

min E[(y − h(x))2 ]. (2.3)

• Therefore, if E(y|x) is linear, E(y|x) = LP(y|x).

2.1 Causal effect

• Causal effect = change in y when x changes, holding everything else fixed.

• Is β1 the causal effect of x on y? (Suppose for simplicity E(y|x) = LP(y|x).)

• This – not the increase in price – is the reason Cov(x, y) > 0.

• Then β1 is not the causal (all else held equal) effect.

• A structural equation is a model where the parameters have a causal interpretation.

• Continuing with the car example, let the structural equation be

• price tends to be higher for high ω: Cov(x, ω) > 0

• Since Cov(x, ω) > 0 in this example, β1 > δ1 .

• Suppose for concreteness that β1 = 3.

structural equation & Cov(x, ω) ⇒ joint distribution of y and x

• If x is exogenous, β1 = δ1 , and OLS gives a good estimate of the causal effect δ1 .

2.3 Randomized experiment

• What if x is not exogenous in the population?

2.4 Instrumental variable

• As an alternative, it is sometimes possible to find a third variable z such that:

Cov(z, x) 6= 0, Cov(z, ω) = 0 (2.7)

• Then the amount paid in tax, z, satisfies (2.7).

2.5 OLS and IV are both method-of-moments estimators

3.1 Convergence in probability and law of large numbers

• As n goes to infinity, Var(x̄n ) = Σ/n approaches zero.

• Then x̄n (not just its expectation) must approach µ.

3.2 Convergence in distribution

• No assumption has been made about the distribution of xi .

• It is therefore striking that its limiting distribution is necessarily normal.

• Consider the support of x̄n for n = 2.

• If we roll a die n times, the resulting x̄n is of course a number.

• A function θ̂ of the sample defined as

where W is an L × L positive definite matrix, is called a GMM estimator.

4.1 Properties of the GMM estimator