You are on page 1of 18

Combining Latent-Class

Choice, CART and CBC/HB to


Identify Significant Covariates
in Model Estimation
George Boomer, StatWizards LLC
Kiley Austin-Young, Comcast Corp.
Abstract
Hierarchical Bayes estimation has proven to be highly accurate in forecasting
holdout samples but has a difficult time allowing for covariates. Attempts to
manipulate the upper level of Bayesian estimation show promise, but so far efforts
to identify key covariates (or more specifically, variables that describe respondents)
have met with mixed results. In the real world we know that covariates are often
important, for example, gender in the handbag market, income in the exotic car
market, age in the market for geriatric medicine. How then, can we identify key
covariates and incorporate them into a CBC simulation?
Our approach makes use of three techniques applied to a common data set. First,
CBC/HB is employed to produce a set of individual-level utilities. Second, a latentclass choice (LGC) estimation identifies groups of respondents who share a common
set of utilities. Third, CART is used to improve upon LGCs covariate classification.
Finally, the latent classes and significant covariates from modern data mining
techniques are brought together in a common market simulator. We use both a
simulated data set and a disguised, real-world example from the
telecommunications industry to illustrate this approach.
This paper is not an attempt to use the covariates in the CBC/HB upper model, nor
is it a direct comparison of the above methods. Rather, it is an attempt to show an
alternative approach for identifying significant covariates in a choice-modeling
exercise.

1. The Problem
Hierarchical Bayes (CBC/HB) combines individual-level estimates of utility with
excellent fit of holdout samples. However, it struggles to accommodate covariates,
or more generally, variables that describe individuals rather than products. CBC/HB
allows covariates to be entered in the upper-level model, but no significance tests
are available for assessing whether covariates improve model fit.

To illustrate the problem, we borrowed a hypothetical data set for a fictional shoe
market from Statistical Innovations, Inc. The product in this data set contains three
attributesfashion, quality and priceand four covariatesage, gender, eye color
and hair color. Each variable has a number of levels, as shown in the following
table:

Figure 1. List of attributes and covariates.

Using an experimental design, we estimated a CBC/HB model using the attributes


on the left. The result was a standard set of individual-level utilities which we read
into Excel.

Figure 2. CBC/HB utilities file imported to Excel

To this file we appended a separate worksheet containing for each respondent


dummy-coded covariates.

Figure 3. Covariates appended to utilities file

Using this combined file, we built a simulator. Like many simulators, this one
supports analysis by subgroups. Starting with an arbitrary scenario in Figure 4,

Figure 4. Excel-based simulator with all covariates

we took a snapshot of this scenario and filtered the data to show only respondents
under 25 years old.

Figure 5. Filtering on respondents age 25 and under

Cell I19 shows that young respondents show a greater preference for Stylish shoes,
product B, by an amount more than 11 percentage points greater than the base
scenario.

Resetting the filter, we now select people with blue eyes by changing the value in
cell B31.

Figure 6. Filtering on respondents with blue eyes

We see that blue-eyed people show a slightly increased (2.1 percentage points)
preference for product C, a basic shoe.
The question is whether either of these scenarios reveals a significant difference in
preferences between filtered and unfiltered groups. CBC/HB by its very nature
cannot tell us, because it does not support significance tests of any kind, yet we
know the technique's hit rates with holdout samples are consistently good. We'd
like to use CBCHB, but we also want to know which covariates show significantly
different preferences for products in this market. How can we combine the two
objectives?

2. A Solution.
We propose the use of alternate methodologies that permit significance tests for
covariates. Here's how the process works:

1. Estimate a model using the same dataset but an alternate methodology such
as Latent GOLD Choice (LGC). Provide citation in References section
2. Include all subgroups as covariates.
3. Perform significance tests on covariates.
4. Eliminate insignificant covariates.
5. Append significant covariates to the CBC-HB utilities file.
6. Optional, if LGC is used: Append segments from the LGC model.
Because results from CBC/HB are not affected, there is no confounding from using
two techniques. We are just identifying which covariates to append to CBC/HB
utilities.
Running an LGC model using the same data set and all covariates, we obtain the
following results for a three-segment model:

Sex and Age pass significance tests.

Eye color and hair color do not.

Figure 7. Latent GOLD Choice estimation with all covariates

As the p-values in column F reveal, sex and age pass Wald tests of significance,
whereas eye color and hair color fail. The presumption of independence of
exogenous variables is not violated, as Latent GOLD employs covariates in separate
logit models to predict membership in segments.

Turning for a moment to the model's attributes, we find that all of these coefficients
pass Wald tests for significance,

All attributes are significant,

but no significant class differences exist among price coefficients

Figure 8. Wald tests on attributes

but the hypothesis that price coefficients are equal across segments cannot be
rejected. We revise our model to eliminate insignificant covariates and impose class
independence on the price coefficient. In the revised model, all significance tests
pass. To the spreadsheet containing our original CBC-HB utilities, we append the

significant covariates only .

All significance tests pass.

Figure 9. LGC model with revised specification

Figure 10. CBC/HB utilities file with significant covariates appended

Employing LGC as an alternative methodology yielded a bonus: classification of


respondents into segments. As an option, we can append these classifications to
the same file.

Figure 11. CBC/HB utilities file with latent classes appended

We can now use this file to build a new simulator, this time including only covariates
that matter along with the option to filter on latent-class segments.

Irrelevant variables eye color and hair color are gone.

We added segmentation variables.

Figure 12. Revised simulator

3. Alternate methodologies
You don't have to use LGC as an alternate methodology; any methodology that can
identify significant covariates in the source data will do. Because it is closely related
to hierarchical Bayes1, mixed logit serves this purpose well and would be our second
choice. Software packages for estimating mixed-logit models are available from a
number of sources, including
Limdeps NLOGIT2

1 Train, Kenneth (2001). "A Comparison of Hierarchical Bayes and Maximum


Simulated Likelihood for Mixed Logit", Department of Economics, University of
California, Berkeley.
2 Available for license from Econometric Software at http://www.limdep.com/.

R library mlogit3
Michel Bielaires Biogeme4
Unlike the second and third programs in this list, NLOGIT produces individual-level
utilities, much like CBC/HB.
Tree-based methods such as SI-CHAID5, CART, random forests and stochastic
gradient boosting6 can also be used to identify covariates. All of these methods
begin with a dependent variable and a set of independent variables. The
dependent variable usually consists of discrete categories. Approaches employed
by these methods vary, but all tree methods identify cutpoints within key
independent variables and use them to create splits, such that the grouping of the
data after splitting becomes more concentrated around one of the dependent
variable's discrete categories. The process continues with additional branches (and
in some cases additional trees) being built until the final nodes are as concentrated
as possible.
In the course of building trees, each method identifies the most important
independent variables that contribute to splits. We can identify these variables and
append them to the CBC/HB utilities file.

4. Another use for tree methods


If you choose LGC as your alternate methodology, you have the option of appending
latent-class (i.e., segment) memberships to your CBC/HB utilities file along with
chosen covariates. If you do this, you can employ the same tree methods described
above to assign out-of-sample subjects to latent classes. Returning to our example,
here is such a tree built by CART.

3 Can be downloaded from http://cran.rproject.org/web/packages/mlogit/index.html.


4 Free download available at http://biogeme.epfl.ch/.
5 SI-CHAID is a product from Statistical Innovations, Inc. For more information, see
http://statisticalinnovations.com/products/sichaid.html.
6 Salford Systems, provides software for all three tree methods. The company uses
the trademark TreeNet for its stochastic gradient boosting software. For more
information, see http://www.salford-systems.com/products.

Figure 13. CART tree used to predict segment membership

Covering this chart in detail lies beyond the scope of this paper, but suffice it to say
that classes 1, 2 and 3 correspond to segments 1, 2 and 3 in Figure 7.
With all of these tree-based methods, a question arises about which to use. The
answer depends partly on the pricing and availability of software and partly on
predictive accuracy. Regarding the latter criterion, we applied various tree-based
methods to our LGC model with the following results:

Overall Predictive Accuracy


80%
70%
60%
50%
40%
30%
20%
10%
0%
Latent GOLD covariate classification

TreeNet

Figure 14. Comparison of predictions using selected tree-based methods

In this example, TreeNet scored best, followed closely by Latent GOLD Choice's
internal covariate classification algorithm, though the differences between LG, CART
and TreeNet are not statistically significant. 7 This finding is not surprising given
TreeNet's consistent performance in a number of data mining competitions. 8
Our approach appears to work well on this synthetic data set, but how does it work
in practice?

5. Comcast example
Comcasts core services comprise a portfolio of voice, video, and data packages.
The companys product pricing, packaging, and planning team was asked to
consider different approaches to product pricing as well as the lineup of the package
components and package constructs.

7 Using a proportions test at a 95% confidence interval.


8 As one example, a TreeNet model placed first in the Duke/NCR Teradata Center for
CRM Competition held in 2003.

As part of this effort, Comcast employed a multi-product choice model to assess the
value of different cable channels based on customer viewing habits and the
potential upside from introducing new packaging portfolio approaches in market
for example, video-inclusive multi-product packages focused less on traditional
product levers such as TV channels and more on a full suite of potential services.

Packages were anchored by services other than video, such as HSD, and
supplemented by a rich set of emerging and differentiated services including home
security/control and IP-based solutions such as storage, were tested in order to
frame an actionable recommendation for content packaging efforts. The choice
model included the following attributes and covariates:
With this set of attributes we estimated an HB model and generated a CBC/HB
utilities file.

Figure 15. CBC/HB utilities from Comcast model

Next, using the same data set but incorporating covariates, we estimated an LGC
model.

Figure 16. LGC model based on Comcast data

The model allowed us to apply significance tests to isolate important covariates. In


this case, personal covariates (age, gender and role in decision making) failed
significance tests, whereas covariates describing content providers passed. We
therefore selected provider variables to append to the CBC/HB utilities file.

TV Provider
ISP
Voice Provider
Figure 17. Significance tests on covariates

We appended the provider variables to the CBC/HB utilities file,

Figure 18. Significant covariates appended to utilities file

and we used the combined file to construct an Excel-based simulator.

Figure 19. Comcast simulator with covariates as filters

6. Summary
In most marketing situations, covariates matter. All other things being equal,
younger people express greater demand for technology products than older people.
Women have greater preference than men for manicures, and the list goes on. In
modeling choices, it's important to identify which of many possible covariates are
the ones associated with demand for a product. We can do that directly in Latent
GOLD choice or mixed logit models, but CBC/HB, arguably the most popular choice
software today, does not easily incorporate covariates and has no way to test them.
In those situations where CBC/HB is the technique of choice, one can employ
alternate methodologies to select important variables to include in a market
simulator.