A Data-Driven Network Analysis Approach To Predicting Customer Choice Sets For Choice Modeling in Engineering Design

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/277578369
A Data-Driven Network Analysis Approach to Predicting Customer Choice

Sets for Choice Modeling in Engineering Design
Article in Journal of Mechanical Design · July 2015

DOI: 10.1115/1.4030160
CITATIONS READS
43 729
2 authors:
Mingxian Wang Wei Chen

Amazon Northwestern University
20 PUBLICATIONS 309 CITATIONS 574 PUBLICATIONS 23,659 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Mingxian Wang on 18 January 2016.
The user has requested enhancement of the downloaded file.

A Data-Driven Network Analysis
Approach to Predicting Customer
Choice Sets for Choice Modeling
in Engineering Design
Mingxian Wang
Department of Mechanical Engineering, In this paper, we propose a data-driven network analysis based approach to predict indi-
Northwestern University, vidual choice sets for customer choice modeling in engineering design. We apply data
Evanston, IL 60208 analytics to mine existing data of customer choice sets, which is then used to predict
e-mail: mingxianwang2016@u.northwestern.edu choice sets for individual customers in a new choice modeling scenario where choice set
information is lacking. Product association network is constructed to identify product
Wei Chen1 communities based on existing data of customer choice sets, where links between prod-
ucts reflect the proximity or similarity of two products in customers’ perceptual space. To
Wilson-Cook Professor in Engineering Design
account for customer heterogeneity, customers are classified into clusters (segments)
Department of Mechanical Engineering,
based on their profile attributes and for each cluster the product consideration frequency
Northwestern University,
is computed. For predicting choice sets in a new choice modeling scenario, a probabilis-
Evanston, IL 60208
tic sampling approach is proposed to integrate product associations, customer segments,
e-mail: weichen@northwestern.edu
and the link strengths in the product association network. In case studies, we first imple-
ment the approach using an example with simulated choice set data. The quality of pre-
dicted choice sets is examined by assessing the estimation bias of the developed choice
model. We then demonstrate the proposed approach using actual survey data of vehicle
choice, illustrating the benefits of improving a choice model through choice set prediction
and the potential of using such choice models to support engineering design decisions.
This research also highlights the benefits and potentials of using network techniques for
understanding customer preferences in product design. [DOI: 10.1115/1.4030160]
Keywords: choice set, choice modeling, customer preference, product association, data
analytics, network analysis
1 Introduction collected. Approach has been taken to either use all vehicles in
the market as the choice set or randomly pick a few vehicles2 to
Customer choice modeling is gaining increasing attention in
form individual choice sets [16,17]. While both treatments may be
engineering design because it allows prediction of future product
reasonable under certain contexts [18], studies show that misspe-
demand as a function of engineering design attributes and the tar-
cification of choice sets may result in inferior choice model esti-
get market description [1]. Recent efforts toward incorporating
mates, especially when a large set of choice alternatives exist
customer preferences into engineering design have addressed a
[19,20]. By contrast, it has been shown that good predictions of
wide spectrum of design interests such as platform-based product
reduced choice sets can better approximate individual choice
family design [2], hierarchical systems design [3], usage and
behaviors and thus yield improved modeling results [21].
social context based design [4,5], multilevel and multidisciplinary
In this paper, we propose a data-driven network analysis
design [6,7], design under market competition [8], consider-then-
approach to predict individual choice sets for choice data with
choose model to design optimization [9], and robust design and
missing choice set information. Given a customer’s profile and the
sensitivity evaluation [10]. Although all of the aforementioned
final choice made, an individual’s choice set is predicted based on
studies involve the modeling of customer choices, the practical
existing choice set data for a similar product. In the aforemen-
implementation is challenged by the availability of choice set data
tioned vehicle example, choice set information gathered from a
[11,12].
different source, e.g., J.D. Power vehicle survey (JDPA), can be
Choice set, in the context of engineering design, is defined as
used for learning and further predicting choice sets in order to use
“a set of product alternatives available to an individual who will
the NHTS data for choice modeling. The key idea of our approach
seriously evaluate through comparisons before making a final
is to construct a customer association network of products using
choice.” The choice set information is often unavailable in the
existing choice set survey data. Network links and their strengths
collected market data, but is rather important for building choice
reflect the proximity or similarity of two products in customers’
models. For example, for vehicle choice modeling, the National
perceptual space, indicating what product alternatives are more
Household Travel Survey (NHTS) is a rich set of data that records
likely to be considered together by a customer. This line of rea-
vehicle ownerships, demographics of households, and detailed in-
soning is supported by rich literature in cognitive psychology
formation on customer travel behaviors in the United States [13];
[22,23] and is similar to the co-occurrence analysis in mining
however, information on individual’s vehicle choice set was not
association rules [24,25]. Taking vehicle as an example, if the
strength of network link is larger between “Toyota Camry” and
1
Corresponding author. “Honda Civic” than that between Toyota Camry and “BMW 7,”
Contributed by the Design Automation Committee of ASME for publication in
the JOURNAL OF MECHANICAL DESIGN. Manuscript received September 9, 2014; final
2
manuscript received March 10, 2015; published online May 19, 2015. Assoc. Editor: Studies show that the number of vehicles a customer seriously considers is often
Bernard Yannou. in the range of 3-6. [14,15]
Journal of Mechanical Design Copyright V

C 2015 by ASME JULY 2015, Vol. 137 / 071410-1
Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

we can infer that a customer who is interested in Toyota Camry is Wnk ¼ f ðb : Ank ; Sn Þ; if k 2 CSn (2)
more likely to consider Honda Civic than BMW 7 in forming a
choice set. The customer-desired product attributes Ank refer to key product
For modeling the psychological process of choice set formation, features that are important to a customer’s choice. The customer
various studies have been conducted in literature in multiple profile attributes Sn include the socio-economic attributes (e.g.,
domains such as marketing and transportation research income), anthropometric variables (e.g., customer height), pur-
[14,15,20,26–34]. Based on the two-stage noncompensatory– chase history, and usage context attributes (e.g., use frequency)
compensatory decision model [15,20,27], a customer is concep- [1]. The model coefficients b are statistically determined based on
tualized as first filtering some irrelevant alternatives using the observational choice data. It is noted that the estimated indi-
relatively simple criteria (noncompensatory decision making) for vidual choice probability heavily depends on the choice set com-
determining the composition of a choice set, and then performing position [12,28]. The decision of which alternatives to be included
detailed trade-offs (compensatory decision making) to evaluate in the choice set influences the overall log-likelihood function of
remaining options in a choice set. Examples of customers’ non- the model as well as the estimated model coefficients.
compensatory criteria include brand preferences, product aware-
ness, budget constraints, product’s range of applicability, and
2.2 Network Modeling of Association Rules. Network anal-
environmental consciousness. Two distinctive classes of methods
ysis has emerged as a key method for statistical analysis of com-
have evolved to address choice set formation or prediction. One
plex relationships depicted in a simple network graph, where
class of methods explicitly generate a selected set of feasible
nodes represent individual entities and ties represent relationships
alternatives for each customer from all possible alternatives using
between the entities [35]. Recent studies in product design have
generation criteria such as screening based on critical attributes
demonstrated the benefits of network analysis to characterize a
[29–32], explicit random constraints to determine the choice set
complex product as a network of components that share technical
probability [26,28], or hierarchical cluster analysis to identify rel-
interfaces or connections [36,37]. Association relations among
evant products [34]. The other class of methods simulates the non-
products have been used to generate customer perceptual maps in
compensatory effects implicitly through a compensatory utility
marketing research [38] and improve the mining strategy of the
function, such as the technique of constrained multinomial logit
transactional data in information system research [24,39]. Never-
model [33].
theless, constructing product association networks based on con-
Different from these existing methods that model choice set
sideration relations in choice sets is not well studied and serve as
effect within utility function or infer choice sets using a priori
a unique contribution of this work for understanding customer
behavioral assumptions, our proposed approach is data-driven.
choice behaviors.
We employ network analysis to mine product consideration asso-
Standard metrics of association rules, such as “lift” [25], can be
ciations based on existing choice set data collected from single or
used to quantify the strength of the connection between two prod-
multiple sources and then predict individual-specific choice sets
ucts based on how often they appear in the same choice set. Lift
for new data. Our key assumptions are: (1) The application prob-
between a product i and a product j is defined as [25]
lem involves a large set of product alternatives and the prediction
of small choice sets is needed; (2) The product alternatives in the Pði; jÞ
existing and new datasets mostly overlap; (3) Customer liftði; jÞ ¼ (3)
social–demographical attributes are similar in both datasets. To PðiÞPðjÞ
account for customer heterogeneity in choice set formation, we
divide a broad market into several distinct customer segments where P(i) represents the probability of occurrence of product i in
based on customer sociodemographical and usage characteristics. a choice set, and P(i, j) is the probability that both i and j appear
In prediction, the probability of selecting an alternative into a in the same choice set. A lift value greater than 1 means i and j
choice set integrates the information of product associations, cus- appear more frequently than if we assume the two products are
tomer segments, and the link strengths in a product association independent, while a lift below 1 indicates negative dependence.
network. Integrating relational information from a product associ- Although the absolute value of lift has no precise statistical mean-
ation network with heterogeneity considered by customer segmen- ing, its tie to the chi-squared test indicates that a larger lift value
tations is a unique aspect of the proposed method. corresponds to a stronger dependence [25]. Using the lift measure,
The rest of the paper is organized as follows. First, technical there is no need to determine ad-hoc thresholds for rules pruning
backgrounds on discrete choice analysis and network modeling and reduction, which could be a common issue when using other
are provided in Sec. 2. Then, a novel data-driven network analysis association measures. Moreover, the lift measure has no direc-
approach for choice set mining and prediction is outlined in Sec. tional implication and treats the product associations uniformly,
3. For illustration and validation, a study using synthetic data is which is desirable for choice set formation.
presented in Sec. 4, and is followed by an example using actual
vehicle data in Sec. 5. Conclusions and methodological implica- 3 A Data-Driven Network Analysis Approach for
tions are summarized in Sec. 6.
Choice Set Prediction
2 Technical Background Two important aspects need to be taken into account in devel-
oping algorithms for choice set prediction. The first aspect
2.1 Discrete Choice Analysis in Engineering Design. The involves the determination of the level of relevance of a product
core idea of discrete choice analysis in engineering design is to to an individual’s purchase decision. Customers are aware of
predict future product demand (choice probability) as a function many brands and models of products, but not all alternatives are
of product design and customer attributes across a heterogeneous seriously considered for purchase [14,20]. Second, choice set is
customer population. The choice utility Unk for a customer n and a the result of individual- and context-specific judgments. The
design alternative k is formed by an observed component Wnk and choice set can be very different even for two similar customers
an unobserved random factor enk. The deterministic part of utility who face the same choice problem. Thus, a realistic model of
Wnk is expressed as a function of customer-desired product attrib- choice set generation should be probabilistic [20]. We present a
utes Ank and customer profile attributes Sn, conditioned on the data-driven network analysis approach to address the above two
individual choice set CSn [1] aspects in choice set prediction. To address the first aspect of iden-
tifying relevant products in choice sets, a product association net-
work is first constructed and product communities are identified.
Unk ¼ Wnk þ enk ; if k 2 CSn (1) To handle the second aspect of heterogeneity and stochasticity,
071410-2 / Vol. 137, JULY 2015 Transactions of the ASME

customer segments are identified to further determine the relative [41] is employed to identify the product communities. The objec-
consideration frequency that is used as the basis for predicting tive function defined using the modularity Q is
customer-specific product consideration probability by taking into
account the link strengths between products. A graphical illustra- 1 X di dj
Q¼ Lij d Ci ; Cj (5)
tion of the major steps involved is shown in Fig. 1. 2m ij 2m
3.1 Mining Product Association Relations and Customer

where Lij represents thePstrength of the link between node i and
Clusters in Choice Set Data. Choice set data can be collected
node j, and m ¼ ð1=2Þ ij Lij is the total strength of links in the
from various sources. Representative methods include using Inter- P
net clickstream data to capture the products viewed by a customer graph. The degree of node i is denoted as di, and di ¼ j Lij . The
and the purchase choice made [40], designing surveys for collect- d-function equals 1 if node i and node j belong to the same com-
ing choice set information [14,27], and collecting customer scan- munity, 0 otherwise. Modularity Q measures the difference
ner panel data or customer purchase history [29]. In this study, between the weighted links that fall within communities and the
survey data of choice sets is analyzed to discover the associations expected value of the same quantity in a random network with the
of product alternatives, identify the strengths of associations, and same node degrees [41]. The larger the modularity value, the bet-
detect product community groups as well as customer segments. ter the community partitioning is.
In Fig. 2, two distinct product communities identified using the
3.1.1 Construction of an Association Network of Product modularity approach are illustrated. One community contains five
Alternatives. In modeling customers’ top-of-mind product associ- small compact vehicles in green dots (e.g., Honda Civic,
ations, we construct a product association network to determine “Hyundai Sonata,” etc.) while the other community is essentially
what products often appear together in customers’ considerations high-performance midsize vehicles in red dots (e.g., Volvo S60,
and how strong the associations are. Each node in the network Lincoln MKX, etc.). Further examination of similarities and
represents a product alternative and a link exists between two differences of product features within each community will help
product alternatives if they frequently co-occur in the same choice better understand how different product attributes influence cus-
set. Based on the definition of lift in Sec. 2.2, we define the tomers’ consideration decisions.
strength of a connection Lij between product i and product j as
3.1.3 Customer Segmentation. To capture the heterogeneity

liftði; jÞ if > 1 among customers in choice set formation, customer segmentation
Lij ¼ (4) is used to group customers based on similarities they share with
0 otherwise
respect to customer profile attributes that are relevant to purchase
needs and preferences. Depending on the market’s characteristics,
A link appears between two products if and only if two products customer segmentation is classified into observable and unobserv-
have a significant positive association indicated by the lift value. able bases. The observable group (including geographic, demo-
An illustrative network with 10 vehicles extracted from the com-
graphic, and socio-economic variables) is the most popular
plete network constructed using JDPA data is shown in Fig. 2. source, as it is often readily available and provides a quick snap-
When a link exists, it implies the two linked product alternatives shot of a market. The unobservable group includes psycho-
are very likely to be considered together by a customer. The
graphics, personality, and buying behaviors, with the assumption
strength value of the link indicates how often the vehicle pairs that people’s activities, interests, and opinions are often decisive
appear together in a choice set (e.g., “Volvo S60” has link strength
factors for their purchase decisions [42]. Customer clusters (G)
4.3718 with “Lincoln MKX”). The isolated node (“Land Rover can be identified by various statistical approaches, such as
Range Rover Sport”) indicates that this vehicle has no significant K-means clustering, latent class analysis, decision trees, hierarchi-
association relations to other vehicles in the network. It should be
cal clustering, and ensemble models [43]. In this study, the K-
noted that in this work the association of products is analyzed means algorithm is used to identify customer segments (see the
based on their co-occurrence probability in customer’s considera-
example in Sec. 5).
tion, as opposed to the similarity of products based on features. As a result of choice set data mining, product associations
3.1.2 Discovery of Product Communities. In network analy- (including the link strengths (L) and communities (C)) and cus-
sis, a community refers to the group of nodes with dense connec- tomer clusters (G) are identified. In Sec. 3.2, we show how these
tions internally and sparser connections between the groups. analytical results are integrated to predict relevant choice sets for
Within a product association network, product communities are individual customers.
identified to group the strongly correlated products together. The
resulting community structure can be used to gain insights into the
market structure of products that customers follow in forming
choice sets. The product communities (C) also provide starting
points for choice set predictions, as more details are provided in
Sec. 3.2.
Different from classical clustering approaches that employ the
distance or similarity matrix, network community detection
requires the use of algorithms developed from graphical proper-
ties. In this research, the Newman’s optimal modularity method
Fig. 2 Illustration of vehicle association network and product

Fig. 1 Graphical illustration of the major steps communities
Journal of Mechanical Design JULY 2015, Vol. 137 / 071410-3

3.2 A Data-Driven Strategy for Predicting Heterogeneous Nk0 Gn 0
FGn ðk0 Þ ¼ ; for 8k 6¼ k and k0 2 Ck (6)
Choice Sets. Given the results from mining the existing choice jC
Xk j1
0
set data, we propose a probabilistic sampling method to predict N k Gn
individual choice sets in new data where the information of cus- k0 ¼1
tomer attributes and their final choices are available but the choice
set information is lacking.
Figure 3 shows an illustrative diagram of how different sources As the relative product consideration frequency defined above
of information come together in the prediction process. Given the does not account for product co-occurance within a community, a
purchase choice k for a customer n, we first identify which prod- metric called co-consideration frequency CFnk (k0 ) of a product k0
ucts are strongly associated with product k based on the product is proposed to take into account the link strength Lkk0 in a product
consideration association network identified using the existing community
data. The candidate alternatives in a choice set will be sampled
from the same product community (noted as Ck) as the purchase FGn ðk0 Þ; if Lkk0 ¼ 0; 0
CFnk ðk0 Þ ¼ for k 6¼ k and k0 2 Ck
choice k. Furthermore, products directly linked to the purchase FGn ðk0 Þ Lkk0 ; otherwise;
choice k with higher link strengths (i.e., higher L) than the other (7)
products in the community are more likely to be considered. For
example, suppose the purchase choice vehicle k is Hyundai Equation (7) shows that when the product k0 is not directly linked
Sonata, a community of products containing Hyundai Sonata can to the purchase choice k in the network (i.e., Lkk0 ¼ 0), the esti-
be identified from the vehicle association network, as shown in mated co-consideration frequency remains the same as the relative
the upper left corner of Fig. 3. Among these products, Honda consideration frequency shown in Eq. (7). When an association
Civic and Toyota Camry are more likely to be considered than link exists between a considered product k0 and the purchase
other vehicles, as the two vehicles are directly linked to Hyundai choice k (i.e., Lkk0 > 1), the estimated co-consideration frequency
Sonata in the network. Moreover, Toyota Camry has a higher is increased by multiplying it with the link strength.
chance to be considered than Honda Civic due to its higher link Our prediction approach follows a sequential sampling proce-
strength (21.337 versus 3.4333). Following the same notation dure in generating choice sets. The sampling probability for each
defined in Sec. 3.1, we represent the link strength between the pur- considered product k0 is referred to as customer-specific product
chase choice k and any other product k0 in the association network consideration probability CPnk (k0 ), which is a normalized quan-
as Lkk0 . tity of the estimated co-consideration frequency CFnk (k0 ). The
Predicting choice set based on product associations and link sum of CPnk (k0 ) over all possible products k0 equals 1.
strengths does not take into account customer heterogeneity. For
example, midage high income people are more likely to consider CFnk ðk0 Þ 0
hybrid electric vehicles (HEV) than other groups of customers; CPnk ðk0 Þ ¼ X ; for 8k 6¼ k and k0 2 Ck (8)
however, this information is missing from the aggreated product CFnk ðk0 Þ
k0
association network. We overcome this limitation by relating
product community Ck to customer cluster Gn to which a customer
n belongs to, through the evaluation of relative consideration fre- The above equations do not address the situation where the pur-
quency (noted as FGn (k0 )) for a product k0 in community Ck. As chase choice k is an isolated node (product) in the network. In this
shown in Eq. (6), based on the existing choice set data, case, we expand the product community Ck in Eqs. (6)–(8) to the
Nk0 Gn computes the total number that the group of customers in Gn whole product network including all available product alterna-
considers product k0 in their choice sets. The summation in the tives. As shown in Eq. (9), the consideration probability in this
denominator indicates the total number that all products (except case is simply the pure relative consideration frequency of a prod-
for the choice k) in community Ck are considered by a group of uct k0 considered by customer cluster Gn.
customers Gn. The ratio in Eq. (6) reflects the relative considera- 0
tion frequency of a product k0 among a community of products Ck CPnk ðk0 Þ ¼ CFnk ðk0 Þ ¼ FGn ðk0 Þ; for 8k 6¼ k (9)
for a specific group of customers Gn. In the preceding example,
when the customer’s final choice is Sonata, the relative considera- Equations (6)–(9) have several implications. First, except for an
tion frequency FGn (k0 ) is evaluated for the other four vehicles in isolated choice k, sampling is restricted to a community of prod-
the community (Camry, Caliber, Civic, and Mazda) in association ucts that are directly or indirectly linked to the purchase choice k
with customer cluster Gn, as shown in the pie chart of Fig. 3 in the product community network. Products outside the commu-
nity network Ck are not considered in sampling. Second, the distri-
bution of the consideration probability CPnk depends on both the
product association community Ck and the customer cluster Gn
through the evaluation of consideration frequency FGn (k0 ) and as
well as the co-consideration frequency CFnk (k0 ). Thus, the
method takes into account the systematic heterogeneity in cus-
tomer preference. Since the sampling follows probabilistic distri-
butions, even if two customers share similar profiles and make the
same purchase choice, the predicted choice sets can still be differ-
ent, taking into account the random heterogeneity in customer
preferences.
To ensure the sampled product alternatives are mutually exclu-
sive in an individual choice set, the sampling procedure is
employed by following one product after another until the choice
set size reaches a desired one. When the size is unknown, the
average number of choice set size in the training data is used as a
guideline. The whole prediction process is operated sequentially
to determine the individual choice sets from one customer to
another.
As introduced in Sec. 1, our approach assumes that the mining
Fig. 3 Illustration of the choice set prediction approach choice set data and the prediction data share very similar product

Table 1 Consideration rules applied to generate true individual choice sets in simulation
Rules for continuous attr. Heterogeneity % of customers (%)

Rule 1: A1 > threshold1L threshold1L Unif U1 ð0:2Þ; U1 ð0:7Þ 60
1 1
Rule 2: A2 < threshold2H threshold2H Unif U ð0:4Þ; U ð0:9Þ 80
Rule 3: A3 > threshold3L threshold3L Unif U1 ð0:1Þ; U1 ð0:4Þ 20
Rule 4: A3< threshold3H threshold3H Unif U1 ð0:6Þ; U1 ð0:9Þ 20
Rules for binary attr. Heterogeneity % of customers (%)
Rule 5 a: A4 ¼ 1 — 20
Rule 5 b: A4 ¼ 0 — 20
alternatives and customer demographics S. When the prediction 4.2 Choice Model Specification and Choice Response
data only contains a subset of alternatives that belong to the min- Simulation. Given the simulated “true” choice sets described
ing choice set data, the group consideration frequency Nk0 Gn for above, the purchase decision is made based on the random utility
the irrelevant products should be set to zero to avoid being as defined in Eq. (1). The prespecified utility function for this case
sampled in prediction. When a new product appears in the new study is shown in Eq. (10), where the systematic component Wnk
data but is not found in the existing choice set data, the new prod- is a linear sum of four product attributes A1–A4. The error compo-
uct will be assumed to be independent from other products and nent enk for all k 2 CSn is randomly drawn from a Gumbel distri-
treated as an isolated node in the association network. Information bution with a variance of p2 =6
of the current market share of the new product will be used for
describing the sampling distribution. Specifically, one can use the Wnk ¼ b1 Ak1 þ b2 Ak2 þ b3 Ak3 þ b4 Ak4 ; if k 2 CSn (10)
choice frequency in the prediction data to approximate Nk0 Gn in
Eq. (6) and compute CPnk in Eq. (9). An illustrative example We study two levels of model noise (error component) on the per-
using two different sets of market data is provided in Sec. 5.4. formance of the proposed approach. Following the method in Ref.
Similarly, for clustering customers based on demographic attrib- [44], two scenarios are examined by adjusting the model coeffi-
utes S, if attributes in the two datasets do not completely overlap, cients in the utility function. In the high noise scenario, parameter
only the shared attributes will be used in clustering. In the extreme coefficients on each of the explanatory variables are assumed to
case where no customer information is available, one can still be (1, 1, 1, 1). By holding the random component’s error term
implement the approach by treating all customers as a homogene- constant, a “low noise” scenario is also created when doubling the
ous group, as shown by our simulation study in Sec. 4. model coefficients to (2, 2, 2, 2), which relatively decreases the
contribution of the random component on the overall utility. Cus-
4 Case Study of Synthetic Data tomer attributes Sn are omitted in the utility function for simplic-
ity. Choice response for each individual customer is determined
The goal of our synthetic data study is to evaluate the improve- by choosing the product alternative that maximizes the individual
ment of choice modeling using the proposed approach for choice utility Unk.
set prediction. Using the synthetic choice set data and the prede-
fined choice model parameters, we are able to measure the estima- 4.3 Mining the Simulated Choice Sets Data. Using the
tion bias of the model parameters as a result of choice set simulated choice set and choice response data, a training set and a
misspecifications under different noise scenarios. testing set are formed: the first 5000 half choice observations as
well as the corresponding true choice sets data are used to learn
4.1 Generation of Choice Set Data. We generate a popula- product associations and customer consideration frequencies. Pre-
tion of 10,000 customers who use different consideration rules dictions of choice sets are carried out for the second 5000 half
over a total of 100 product alternatives. Product alternatives are choice observations where the choice sets are assumed to be
described by four explanatory variables (A1–A4), three of which unknown. Using the lift association analysis introduced in Sec.
(A1, A2, and A3) are continuous variables while the remaining one 2.2, 1626 pairs of relations are identified, among which 285
(A4) is a binary variable. Literature shows that customers often (17.53%) links with a value below 1 are removed from the net-
develop heuristic decision rules to select products in forming their work to reduce the noises embedded in the data. To identify prod-
choice sets [27]. Five hypothetical decision rules covering all four uct communities, the Newman’s optimal modularity method [41]
product attributes are created for choice set formation. Using vehi- is employed and solved using a greedy algorithm [51] in R envi-
cle as an example, rules can be associated with the price and fuel ronment with the igraph package [52]. The algorithm identifies
economy. As shown in Table 1, we define Rule 1 and Rule 3 by three product communities and one isolated node within the 60
setting lower thresholds on attributes A1 and A3. Rule 2 and Rule considered product alternatives. Since customer attributes Sn are
4 are associated with upper thresholds on A2 and A3. The hetero- not considered in this choice model, all customers are treated
geneity of a decision rule among customers is modeled by the identically and combined into one single cluster.
threshold value, which is randomly sampled from a bounded uni-
form distribution whose range is defined by the attribute level of 4.4 Prediction of Choice Sets for Simulated Data. Follow-
existing products. For example, the bounded uniform distribution Following the approach described in Sec. 3.2, the choice set sam-
for threshold2H in Rule 2 guarantees that the sampled threshold pling process proceeds by computing the customer-specific
value is higher than at least 40% of low-end products (U1 ð0:4Þ) product consideration probabilities CPnk (k0 ) in Eqs. (6)–(9). Once
and lower than at least 10% of available products (U1 ð0:9Þ) in the sampling distribution is determined, alternative products are
the data, where U1 ðÞ stands for the standardized normal distribu- drawn sequentially to generate the predicted choice sets. The num-
tion function. In generating the synthetic data, we also take into ber of sampled products for each choice set is fixed at 12, which is
the fact that not all customers will consider all rules. The fraction determined by the average choice set size from the training data.
of customers who consider a particular rule is shown in Table 1. A The process is repeated for every customer in the test data.
customer may consider all five rules or none of the rules listed.
The simulated data have an average choice set size of 13, among 4.5 Estimation of Parameter Bias for Choice Models. To
the 60 out of 100 alternatives being considered one or more times. assess the effectiveness of the predicted choice sets for improving

choice modeling, we use the mean absolute percentage error explicitly modeled even though it can be considered using the
(MAPE) to compute the bias of the estimated model parameters, mixed logit technique by modeling br as random parameters. Our
as in Ref. [45] mixed logit test shows that the concluded benefit of using the pro-
posed choice set generation approach for reducing the parameter
R ^
100% X br br bias in model estimation is still valid. In the case study of vehicle
MAPE ¼ (11) choice problem presented next, customer heterogeneity is consid-
R r¼1 br
ered by introducing the customer profile attributes as model
inputs.
The MAPE metric assesses how good (on average) the estimated
model parameter b^r is relative to the target true parameter br,
averaged over the total number of model parameters (R ¼ 4 in the 5 Case Study of Vehicle Choice
example). The b^r is determined by the maximum likelihood crite- In the second case study, we apply the proposed approach to
rion using 5000 test data together with the specified or predicted two sets of actual vehicle data for choice modeling. The primary
choice sets. With all other factors being controlled, the discrepan- data source is the 2007 Vehicle Quality Survey from JDPA. The
cies in parameter estimates are only associated with the specifica- choice set information is obtained by asking customers to report
tion error of choice sets. Although MAPE does not take into exactly three other vehicles that they seriously considered, in
account the standard error of the parameter estimates, we note that addition to the final purchase choice. The dataset used for this
in this particular case all parameter estimates are statistically sig- study contains 7279 nation-wide respondents who purchased over
nificant and the impact of standard error is trivial. 262 different vehicles. Data of customer profile attributes S,
The MAPE is computed for four multinomial logit choice mod- including the sociodemographics and the usage context attributes
els with different choice set structures for comparison, including (e.g., local/highway usage, miles driven daily), are collected at
(1) the true choice set model; (2) the universal choice set model individual level.
[20], which utilize the full set of 60 alternatives; (3) the random For the purpose of validation, we divide the JDPA survey data
choice set model, which samples the alternatives from a uniform into 70% for training and 30% for validation. The training set is
distribution; and (4) the predicted choice set model using the pro- used to identify the vehicle association network and customer
posed approach. To consider the sampling uncertainties in gener- clusters, while the validation set is used to examine the prediction
ating the choice sets for Model 3 and Model 4, the b^r for Models quality of choice sets. In addition to the JDPA choice set data, the
3 and 4 are computed as the sample average of 100 experiments, NHTS is used as the second data source for predicting choice sets
including 100 times of choice set generations and 100 times of and fitting choice models. The data presented in the case study
model estimations. The sample variances for all b^r s are found to contain a sample population of 13,802 respondents from Califor-
be small (104–105) and thus not reported. For a fair compari- nia with vehicles purchased in brand-new conditions during
son, the choice set size for the two models are equal to the aver- 2002–2009. NHTS data include a similar set of information as
aged true choice set size. JDPA data on customer demographics, vehicle ownership, and
As the results shown in Fig. 4, Model 1 with the true choice vehicle usage, but the choice set information on which vehicle(s)
sets gives the smallest MAPE, meaning that the estimated parame- a respondent considered before purchasing was not collected.
ters are the “closest” to the true parameters among the four choice Among the 217 vehicle models involved in the NHTS data, 200
models. For the other three models, Model 4 with the predicted vehicle models are mentioned at least once by respondents in
choice sets has a smaller MAPE, implying reduced bias and better JDPA survey. Since most of the vehicle alternatives are the same
quality of its parameters. It is also observed that the size of the in both datasets, the JDPA data are an ideal source for learning
error component has an impact on the performance of the pro- and predicting choice sets for NHTS data.
posed approach. In the high noise scenario where the unobserved
uncertainty is relatively large, the bias reduction is less significant
5.1 Network Association Analysis
(from 27.2% to 15.5%). In the low noise scenario where the ran-
dom component is small relative to the systematic component, the 5.1.1 Network Visualization and Graphical Analysis of the
bias reduction is more considerable (from 25.6% to 10.8%). This Vehicle Data Sets. We begin our discussion by showing the topo-
suggests that the proposed approach is more capable of generating logical properties of the constructed product association network
consistent model estimates when the uncertainty of choice utility generated with the training data (JDPA data). Examining the char-
is reduced. acteristics of this network helps us understand the nature of asso-
Although parameter error is not the exact model error, values of ciations as well as the market structure. In the lift analysis, we
the estimated parameters inform the product utility, and thus find 16,166 pairs of relations, meaning that 47.3% of the pairs of
directly influence customers’ purchase decisions afterwards. In vehicle models are coconsidered at least once from the data.
this synthetic example, the customer heterogeneity is not Despite the large number of vehicles involved, only 11,006
(15.1%) of the identified relations are significant with a lift value
greater than 1. As such, the actual vehicle data reveals more ran-
domness in terms of the association relations compared to the syn-
thetic dataset examined in Sec. 4. Figure 5(a) shows the histogram
of the pruned link strength along with a fitted power-law curve.
The heavy-tailed distribution suggests that only few pairs of prod-
ucts are strongly associated while the association rules for most of
the vehicle alternatives are not highly evident.
In the network analysis, the degree of a node is defined as the
number of links it has to other nodes. Conceptually, it can be
interpreted as a measure of importance of the node in a network,
also known as the degree centrality [35]. In the vehicle association
network, the existence of products with a high-degree centrality
means that they are often considered with a broad range of
vehicles in a choice set. Figure 5(b) summarizes the degree distri-
bution of the product association network. This network exhibits
Fig. 4 Bias quantification of estimated choice models under the scale-free property that nodes are not uniformly linked but are
two noise scenarios congregated around a few “centered” products. This suggests that

Fig. 5 Frequency distribution of network statistics: (a) link strength and (b) node degree
the average product is considered infrequently with the majority The above analysis presents the advantages of using network
of its neighbors, and frequently with only a few in the market. To techniques to simultaneously measure the association of 262 car
better understand the structural implications in a product associa- models based on the customer data. Beyond traditional association
tion network, we compared the products with high centralities to analysis, the network technique allows us to analyze and visualize
the products with high occurrence frequency in choice sets. The a large number of product alternatives, and thereby provides us
overlap rate of the top-30 vehicles in terms of node degrees (cen- with a better visual understanding of the positions and roles of
trality) and occurrence frequency is 50.0%. This indicates the various products in a market. Furthermore, commonality and
products that are widely considered are not necessarily the most differences of the product features within and across product com-
popular alternatives in choice sets. The most compared-against munities can be further examined to draw insights into noncom-
vehicle model is “Nissan Altima,” which has been compared at pensatory criteria customers use and guide product designs.
least once to other 117 vehicle models. Toyota Camry and “Honda
Accord” are the top two vehicle models in terms of the frequency 5.1.2 Vehicle Community Structure in a Weighted Network.
of occurrence in choice sets. “Toyota FJ Cruiser” and “Toyota To further explore the vehicle association structures, we inspect
Avalon” are two of the top-10 vehicle models in degree centrality, the community of vehicles considered frequently together within
but they are not on the top-30 most considered vehicles based on a group but less frequently with other groups of vehicle models.
the training set of the JDPA data. Six communities are identified within the large interconnected
Figure 6(a) provides a visualization of the processed network, component, as shown in Fig. 6(a) with different colors. The com-
where different colors stand for multiple vehicles communities munity size varies from 10 vehicles (3.8% of total vehicles) to 78
identified. The layout of the network is optimized using the vehicles (29.8% of total vehicles). The type of vehicles falling
Fruchterman–Reingold algorithm [46], where nodes are posi- into each identified community is listed in Table 2. For example,
tioned in an aesthetically pleasing way, but not inversely propor- community C-1 contains 56 vehicles among which 39 are passen-
tional to the strength of link. The network comprises a giant ger cars (CAR) and 17 are sport utility vehicles (SUV). By con-
interconnected component and 17 isolated nodes. The presence of trast, multiactivity vehicles (MAV) dominate the community C-2
isolated vehicle alternatives is caused by its low frequency of over other types of vehicles. This implies that the grouped vehicle
occurrence in choice sets and its random association pattern to communities show a highly organized structure in terms of vehicle
other vehicles. As noted, this vehicle association network has a types, but sometimes customers do consider different types of
remarkably low density (0.075), while other network statistics are vehicles together. Similar analysis can be performed on other ve-
comparable to a typical social network of the same size [47]. hicle features to study the importance of vehicle features in cus-
tomer’s consideration process.
5.1.3 Customer Segments. In customer clustering, seven cus-
tomer profile attributes S, including five sociodemographic attrib-
utes and two usage context attributes, are extracted for each
customer from the training data. Note that “Gender” is coded as 1
for male and 2 for female, and all other attributes are coded by
their numeric levels. All of the variables are standardized (cen-
tered to zero and scaled by dividing its standard deviation) before
clustering. The K-means algorithm [48], a basic unsupervised
clustering method, is used for customer clustering. Eight clusters
are identified that achieve a good balance between the classifica-
tion quality and the model efficiency.
Table 3 shows the cluster size and the cluster mean values of
the standardized customer profile variables S. The identified clus-
tering pattern shows both the customer heterogeneity and the cor-
relation between different profile attributes. For example,
customers in cluster G-5 are older people with small family size,
while customers in cluster G-6 tend to have higher income and
higher education level. To extract more useful patterns from the
Fig. 6 Vehicle association network with colored communities customer cluster data, other statistical techniques, such as post-
and structural properties: (a) network derived using training clustering methods and discriminant analysis techniques, can be
choice sets and (b) network derived using predicted choice performed to identify key customer characteristics that differenti-
sets ate the clusters.

Table 2 Vehicle choice set community
Choice set community no. Size % of total vehicles (%) CAR MAV MINI CAR MINI-VAN PICKUP SUV
C-1 56 21.4 39 0 0 0 0 17
C-2 78 29.8 7 69 0 0 2 0
C-3 48 18.3 30 11 0 0 0 7
C-4 42 16.0 22 3 9 1 0 7
C-5 10 3.8 0 0 0 10 0 0
C-6 11 4.2 0 0 0 0 11 0
Isolates 17 6.5 5 5 0 1 4 2
Total 262 103 88 9 12 17 33
is performed based on the customer-specific product consideration

probabilities defined in Eqs. (6)–(9). An example of how a choice
set is generated is given as follows. Take a senior vehicle buyer
who belongs to customer cluster G-5 as an example. Given the
fact that he selected a “Honda CR-V” in his final purchase, it is
detected from the established association network that Honda CR-
V falls into the vehicle community C-2 with 78 vehicles in total.
This means that the sampled vehicles will be restricted to the ones
in that community. Following the sampling distribution consti-
tuted by the consideration probabilities in Eq. (8) that also takes
into account the link strength, three alternatives are sequentially
drawn probabilistically (e.g., “Nissan Xterra,” “Toyota High-
Fig. 7 Examples of customer heterogeneity in choice set pref- lander,” and “Toyota RAV 4”) in addition to the chosen vehicle
erence. CF and CP for customers purchased the same Hyundai (Honda CR-V) to form the individual choice set.
Entourage but belong to different clusters.
5.3 Verification and Validation of the Predicted Choice
As an example of showing how customer heterogeneity is con- Sets. For this case study, we validate the proposed approach from
sidered in choice set prediction, we present in Fig. 7 the relative both in-sample (Sec. 5.3.1) and out-of-sample tests (Secs. 5.3.2
consideration frequency (CFnk) and the consideration probability and 5.3.3). First, we examine the similarity of product association
(CPnk) for two customers who purchased Hyundai Entourage (in rules between the training choice sets and the predicted choice
community C-5) but belong to different customer clusters. Com- sets by observing their respective network topological properties.
munity C-5 is characterized by midsize minivans with price rang- We then present an accuracy measure at the aggregate level to
ing from $21 k to $31 k. As indicated by CF, customers compare two matrices formed by the predicted choice sets and the
characterized by high income and high education (G-6) consider actual choice sets. Finally, we provide the hit-rate information at
high-class expensive minivans more frequently than other cars, the individual level between the predicted choice sets and the
e.g., Car 6 (Honda Odyssey), Car 8 (Nissan Quest), and Car 9 actual choice sets.
(Toyota Sienna). Customers with smaller family size (G-5) prefer
cheaper minivans and have lower expectation in size, e.g., Car 2 5.3.1 Verification of Association Rules in the Predicted
(Chrysler Town & Country) and Car 4 (Dodge Grand Caravan). Product Association Network. The derived product association
Taking into account the co-occurrence probability of two vehicles network using the predicted choice sets and its topological proper-
in a choice set, the consideration probability CP is developed by ties are shown in Fig. 6(b). Compared to the association network
modifying CF using the association strength between vehicles based on the training data in Fig. 6(a), the color profiles of the
(Eqs. (7) and (8)). As noted, even though Car 4 (Dodge Grand two networks reveal remarkable similarities, implying the com-
Caravan) is frequently considered by customers in G-5, its low co- munity compositions of the two networks are nearly identical. Sta-
occurrence with the purchase choice Hyundai Entourage leads to tistics suggest that 91.6% of vehicle nodes own the same
a low consideration probability CP for sampling in prediction. community membership in the two networks, while most of the
mismatched color nodes are isolated nodes in the training net-
work. The differences of the two association networks can be
5.2 Predicting Individual Choice Sets Within J.D. Power explained by the random treatment of isolated nodes in the predic-
data. For measuring the prediction quality of choice sets, we use tion strategy as a result of missing information of association
a hold-out sample in JDPA data to implement the prediction strat- rules.
egy. The choice set for each respondent n is predicted given the
information of customer profile Sn and the final purchase choice k. 5.3.2 Aggregated Precision of the Predicted Choice Sets. To
Following the method presented in Sec. 3.2, the sampling process evaluate the effectiveness of the predicted choice sets in matching
Table 3 Customer segments generated by K-means clustering based on S attributes
Customer cluster No. size Sgender Sage Sincome Schildren # Seducation Slocal/hwy usage Smiles driven daily
G-1 716 0.781 1.001 0.662 0.328 0.306 0.056 0.051

G-2 735 1.280 0.601 0.023 0.543 0.215 0.116 0.370
G-3 458 1.280 0.421 0.458 1.357 0.037 0.241 0.036
G-4 671 0.781 0.186 0.655 1.559 0.221 0.048 0.075
G-5 693 0.775 1.118 0.454 0.551 0.759 0.040 0.307
G-6 852 0.757 0.650 0.739 0.466 0.890 0.053 0.287
G-7 668 1.280 1.107 0.711 0.383 0.016 0.249 0.028
G-8 302 0.385 0.012 0.006 0.006 0.032 0.577 2.584

the true choice sets data at the aggregated level, we establish two predicted. These figures indicate a significant improvement, while
matrices formed by the predicted choice sets and the actual choice a perfect match of individual choice sets is not expected due to
sets, respectively. With dimension being equal to the number of the stochastic nature of customer behavior [50]. In this study, the
product alternatives (262 262), the matrix cell records the co- hit rate accuracy is affected by the limited customer profile attrib-
occurrence frequency between each pair of products. We employ utes. We expect the hit-rate accuracy will further increase when
the quadratic assignment procedures (QAP) [49] to test whether more relevant customer attributes are collected, as richer customer
the correlation between two matrices is statistically significant. As information can help better characterize and predict customer het-
a nonparametric technique, QAP does not rely on the assumption erogeneity in the consideration behavior.
of independence and is especially useful when the observations
being analyzed are not independent of one another [49]. We note 5.4 Choice Modeling With the Predicted Choice
from our testing result that there is an observed positive correla- Sets. Using the proposed sampling strategy in Sec. 3.2, the mined
tion of 0.6080 between the predicted choice set matrix and the results from JDPA data are used to predict choice sets for the
actual choice set matrix. If we pair the product alternatives ran- NHTS data where the choice set is unknown. Three vehicle alter-
domly for the predicted choice set matrix, the average value of natives are sampled for each individual choice set, in addition to
correlation is only 0.0002 over 5000 iterations of QAP permuta- the purchase choice itself.
tion. This result suggests that there is a strong correlation between For discrete choice analysis, we compare three multinomial logit
the matrix formed by the predicted choice sets and the one by the (MNL) models using choice sets generated by different methods,
actual choice set, which means the two matrices are consistent including the one generated by the proposed approach. The idea is
with each other. The fact that the correlation measure for the pre- to study the effect of choice set compositions on the choice model
dicted choice sets differs significantly from a random result, fur- estimates and the overall goodness-of-fit. In each model, the spe-
ther confirms the effectiveness of the proposed generation cific structure of the individual choice utility follows Eq. (12). Al-
strategy. ternative specific constants for each vehicle are omitted to allow for
choice prediction of new vehicles in vehicle design
5.3.3 Individual Precision of the Predicted Choice Sets. At
the individual customer level, we calculate the hit-rate metric as Wnk ¼ bA Ak þ bAS Ak Sn ; if k 2 CSn (12)
the ratio of the hit size (total number of products predicted cor-
rectly) to the prediction size (total number of products for predic- The customer profile attributes S included in model estimation are
tion). Using the validation set of the JDPA data, the average hit household income, number of children under 18, and fuel price at
rate for the predicted choice sets is 13.05%, while the same metric the vehicle purchase year. The customer-desired product attributes
for random sampling is only 1.10%. Specifically, for the predicted A are selected to include vehicle price, vehicle origin, vehicles
choice sets, only 0.1% of customers have all three vehicles cor- type, HEV indicator, footprint, mileage per gallon (MPG), and
rectly predicted; nevertheless, this number increases rapidly to horsepower. Vehicle origin and vehicle type are coded as a
24.6% in which case at least 1 out of 3 vehicles are correctly sequence of dummy variables to classify a vehicle into certain
Table 4 Estimates from multiple MNL models with different choice set specifications using validation data
Model 1 Model 2 Model 3
MNL w/ universal MNL w/ randomly sampled MNL w/ predicted choice set

choice sets choice sets using the proposed method
Choice set size 217 4 4
No. of Obs 2,995,034 55,208 55,208
Log-likelihood (null) 4.53 107 1.17 107 1.17 107
Log-likelihood (model) 4.23 107 1.03 107 9.88 106
2
McFadden’s pseudo R 0.0664 0.1184 0.1532
Attributes
Aprice/Sincome 0.1921 (0.0004) 0.1998 (0.0005) 0.1183 (0.0007)
Vehicle origin (domestic as base)
AEuropean 0.2396 (0.0016) 0.0584 (0.0018) 0.0952 (0.0020)
AJapanese 0.6510 (0.0008) 0.5745 (0.0010) 0.9503 (0.0011)
AKorean 0.4043 (0.0019) 0.5090 (0.0022) 1.2703 (0.0021)
Vehicle type (car as base)
AMAV 0.0322 (0.0009) 0.0064 (0.0011) 0.2076 (0.0016)
Amini car 0.5957 (0.0025) 0.1938 (0.0031) 0.9573 (0.0029)
Aminivan 0.1212 (0.0016) 0.1712 (0.0020) 0.2620 (0.0033)
Apickup 0.5503 (0.0013) 0.5211 (0.0017) 0.4199 (0.0036)
ASUV 0.4696 (0.0018) 0.5090 (0.0019) 0.0522 (0.0024)
AHEV 2.8948 (0.0066) 2.1273 (0.0082) 1.8395 (0.0085)
AHEV Sfuel price 0.6814 (0.0024) 0.7041 (0.0030) 0.7308 (0.0032)
Afootprint 1.4403 (0.0042) 0.8816 (0.0049) 1.6733 (0.0060)
Afootprint Schildren 0.8437 (0.0018) 1.0015 (0.0023) 0.8928 (0.0030)
AMPG 0.1616 (0.0001) 0.0496 (0.0001) 0.0368 (0.0001)
Ahorsepower 0.0036 (0.0000) 0.0035 (0.0000) 0.0032 (0.0000)
Note: All estimates are statistically significant at the 0.01 level.

categories. The vehicle dimension is represented through the foot- approach. In both examples, the results of predicted choice sets
print variable, which is essentially the product of vehicle length are used to estimate MNL models.
and vehicle width. MPG and horsepower reflect the fuel economy Our results from the two examples, one using synthetic data
and peak performance of a vehicle, respectively. All models are and the other using real vehicle data, consistently indicate that the
described using the same set of explanatory variables. The esti- proposed approach leads to promising results. In the first example,
mated coefficients and their standard errors (in parenthesis) are we apply the approach to a simulated dataset with choice set data
reported in Table 4. generated by a priori consideration rules. As the true choice utility
Model 1 formulates a universal choice set model where the function is prespecified in the simulation, we quantify the choice
choice set is composed of all 217 vehicle alternatives in NHTS model bias as a result of the misspecification of choice sets. By
data. In model 2, three product alternatives are randomly drawn comparing three choice models with different choice sets under
from the 216 vehicles in addition to the chosen vehicle to form a two noise scenarios, our simulation results confirm that choice
choice set of size four. In model 3, individual choice sets are pre- sets have a significant impact on the estimated model coefficients,
dicted using the proposed approach, based on the association rela- and the parameter bias of the estimated choice model is signifi-
tions and customer preferences in the JDPA data. As noted, all cantly reduced with the predicted choice sets using our approach.
model estimates are statistically significant for the three MNL In the second example with actual vehicle data, we utilize network
models. topological properties to test the similarities of association
As observed, the estimated coefficients in model 2 with random relations identified. We then evaluate an aggregated precision
choice sets look similar to those from model 1 with the universal metric of product associations using QAP permutations and an
choice set, with only one exception on MAV. This is not surpris- individual-level precision metric for out-of-sample validity.
ing, because based on the uniform conditioning property [17], a Finally, we demonstrate the process of transferring the data min-
small random set of alternatives drawn from the universal set will ing results from the existing choice set data to a new dataset for
produce consistent parameter estimates, as long as the estimated prediction. The model estimation result strengthens the point that
choice model meets the independence of irrelevant alternatives the prediction of individual choice set is useful for modeling com-
property. Compared to model 1, we also observed increased stand- pensatory decisions on customer choice when a very large number
ard errors for coefficients in model 2. of product alternatives exist.
As we focus on the individual predicted choice sets, model 3 Our research also highlights the benefits and potentials of using
provides the highest goodness-of-fit (0.1532) with coefficient esti- network techniques for understanding customer preferences in
mates being quite different from the other two models. Noticeable product design. In particular, network analysis helps not only in
declines are found in the coefficients on Korean, MAV, Minicar, capturing implicit product relations but also in visualizing associa-
Pickup, and increases in price/income, SUV, and HEV. Intuitively, tion rules in an organized network graph. As a result, characteris-
the increase of coefficient for price/income can be explained by tics of choice sets are summarized based on network topological
the fact that customers are less concerned about the price in final features rather than using a priori or intuitive basis of knowledge.
comparison if they have screened out most of unacceptable The ultimate goal of improving choice model estimation in en-
vehicles in consideration. The increase of SUV coefficient sug- gineering design is to improve the quality of design decisions.
gests that customers who considered this specific type of vehicle Compared to an MNL model with entire set of products or a small
is very determined to buy it when the choice set effect is explicitly random set of products as choice sets, the choice models built
considered. Similarly, the coefficient of HEV implies that hybrid using the predicted choice sets as input are more trustworthy. The
vehicles are more attractive to those customers who actually obtained choice models for the vehicle problem demonstrate the
included hybrid vehicles in their choice sets. potential of using such models to support engineering design deci-
Through model comparisons, we conclude that choice sets do sions as vehicle attributes are explicitly included as model inputs.
have an impact on choice model estimates. The model estimation More vehicle engineering design attributes can be introduced into
result confirms that the proposed method has the ability to restrict the choice model in addition to those included in this paper. Com-
choice set products to relevant alternatives when conducting dis- parison with methods that employ heuristic rules for choice set
crete choice analysis with large number of alternatives. The built prediction can also be considered. The derived information of
choice models can be used further to support engineering design product associations is not only useful for choice modeling, but
decisions by optimizing the vehicle attributes in the models. can also be used to derive market segments in designing products
like product family. To further evaluate the performance of the
proposed approach, different forms and structures of choice mod-
6 Conclusions els will be tested and used in design applications.
In this paper, a data-driven network analysis approach is pro-
posed to mine the existing data of choice sets and then predict het- Acknowledgment
erogeneous choice sets for individual customers in new choice
data where the choice set information is lacking. This network- Partial support to Mingxian Wang’s research from the Design
analysis based method is particularly useful when a large set of Cluster fellowship at Northwestern University is greatly appreci-
choice alternatives exist, and the alternatives and customer profile ated. We are also grateful to the support from the National Sci-
attributes largely overlap in the existing and new data sets. A ence Foundation, Engineering Design program, CMMI-1436658.
range of methods such as customer survey and on-line behavior
tracking can be used for data collection. In data mining, complex
product relations in choice sets are modeled as a large product
Nomenclature
association network, in which link analysis and community detec- Ank ¼ customer-desired product attributes of product k by
tion are applied to extract a set of product association rules. customer n
Customer systematic heterogeneity is analyzed using market seg- Ck ¼ network community for product k
mentation by developing homogenous customer profile clusters. CFnk (k0 ) ¼ co-consideration frequency of a product k0 for
In prediction, individual-specific choice sets are probabilistically customer n who purchased product k
generated from a sampling distribution which unites product asso- CPnk (k0 ) ¼ customer-specific product consideration probability
ciation strength and communities, customer segments, and prod- of a product k0 for customer n who purchased product
uct consideration frequencies. Since the sampling distribution for k
choice set formation is probabilistic, random heterogeneity among CSn ¼ choice set for individual customer n
customer consideration sets is thereby taken into account. Two FGn (k0 ) ¼ relative consideration frequency for a product k0 by
examples are provided to illustrate and test the proposed customers in cluster Gn

Gn ¼ customer cluster for individual customer n [22] Anderson, J. R., and Bower, G. H., 1974, “A Propositional Theory of Recogni-
tion Memory,” Memory Cognit., 2(3), pp. 406–412.
HEV ¼ hybrid electric vehicle [23] Henderson, G. R., Iacobucci, D., and Calder, B. J., 1998, “Brand Diagnostics:
JDPA ¼ J.D. Power vehicle survey Mapping Branding Effects Using Consumer Associative Networks,” Eur. J.
Lij ¼ network link strength between product i and product j Oper. Res., 111(2), pp. 306–327.
MAPE ¼ mean absolute percentage error [24] Pandey, G., Chawla, S., Poon, S., Arunasalam, B., and Davis, J. G., 2009,
“Association Rules Network: Definition and Applications,” Stat. Anal. Data
MAV ¼ multi activity vehicles Min., 1(4), pp. 260–279.
MNL ¼ multinomial logit [25] Silverstein, C., Brin, S., and Motwani, R., 1998, “Beyond Market Baskets: Gen-
NHTS ¼ National Household Travel Survey eralizing Association Rules to Dependence Rules,” Data Min. Knowl. Discov-
QAP ¼ quadratic assignment procedures ery, 2(1), pp. 39–68.
[26] Swait, J., and Ben-Akiva, M., 1987, “Incorporating Random Constraints in Dis-
Sn ¼ customer profile attributes crete Models of Choice Set Generation,” Transp. Res. Part B: Methodol., 21(2),
Unk ¼ customer choice utility of alternative k by customer n pp. 91–102.
Wnk ¼ observed part of the customer choice utility of alter- [27] Hauser, J. R., 2014, “Consideration-Set Heuristics,” J. Bus. Res., 67(8), pp.
native k by customer n 1688–1699.
[28] Ben-Akiva, M., and Boccara, B., 1995, “Discrete Choice Models With Latent
b ¼ discrete choice model parameter in customer’s utility Choice Sets,” Int. J. Res. Mark., 12(1), pp. 9–24.
function [29] Andrews, R. L., and Srinivasan, T., 1995, “Studying Consideration Effects in
enk ¼ unobserved part of the customer choice utility of Empirical Choice Models Using Scanner Panel Data,” J. Mark. Res., 32(1), pp.
alternative k by customer n 30–41.
[30] Swait, J., 2001, “A Non-Compensatory Choice Model Incorporating Attribute
U1 ¼ quantile function of the standard normal distribution Cutoffs,” Transp. Res. Part B: Methodol., 35(10), pp. 903–928.
[31] Cantillo, V., and Ort uzar, J. D. D., 2005, “A Semi-Compensatory Discrete
Choice Model With Explicit Attribute Thresholds of Perception,” Transp. Res.
References Part B: Methodol., 39(7), pp. 641–657.
[1] Chen, W., Hoyle, C., and Wassenaar, H. J., 2013, Decision-Based Design: Inte- [32] Dieckmann, A., Dippold, K., and Dietrich, H., 2009, “Compensatory Versus
grating Consumer Preferences Into Engineering Design, Springer, London. Noncompensatory Models for Predicting Consumer Preferences,” Judgment
[2] Kumar, D., Chen, W., and Simpson, T. W., 2009, “A Market-Driven Approach Decis. Making, 4(3), pp. 200–213.
to Product Family Design,” Int. J. Prod. Res., 47(1), pp. 71–104. [33] Martınez, F., Aguila, F., and Hurtubia, R., 2009, “The Constrained Multinomial
[3] Hoyle, C., Chen, W., Wang, N., and Koppelman, F. S., 2010, “Integrated Logit: A Semi-Compensatory Choice Model,” Transp. Res. Part B: Methodol.,
Bayesian Hierarchical Choice Modeling to Capture Heterogeneous Consumer 43(3), pp. 365–377.
Preferences in Engineering Design,” ASME J. Mech. Des., 132(12), p. 121010. [34] Silva-Risso, J., and Ionova, I., 2008, “Practice Prize Winner-A Nested Logit
[4] He, L., Chen, W., Hoyle, C., and Yannou, B., 2012, “Choice Modeling for Model of Product and Transaction-Type Choice for Planning Automakers’ Pric-
Usage Context-Based Design,” ASME J. Mech. Des., 134(3), p. 031007. ing and Promotions,” Mark. Sci., 27(4), pp. 545–566.
[5] He, L., Wang, M., Chen, W., and Conzelmann, G., 2014, “Incorporating Social [35] Wasserman, S., and Faust, K., 1994, Social Network Analysis: Methods and
Impact on New Product Adoption in Choice Modeling: A Case Study in Green Applications, Cambridge University, Cambridge.
Vehicles,” Transp. Res. Part D: Transp. Environ., 32(2014), pp. 421–434. [36] Sosa, M., Mihm, J., and Browning, T., 2011, “Degree Distribution and Quality
[6] Kim, H. M., Kumar, D. K., Chen, W., and Papalambros, P. Y., 2006, “Target in Complex Engineered Systems,” ASME J. Mech. Des., 133(10), p. 101008.
Exploration for Disconnected Feasible Regions in Enterprise-Driven Multilevel [37] Sosa, M. E., Eppinger, S. D., and Rowles, C. M., 2007, “A Network Approach
Product Design,” AIAA J., 44(1), pp. 67–77. to Define Modularity of Components in Complex Products,” ASME J. Mech.
[7] Michalek, J. J., Feinberg, F. M., and Papalambros, P. Y., 2005, “Linking Mar- Des., 129(11), pp. 1118–1129.
keting and Engineering Product Design Decisions Via Analytical Target [38] Netzer, O., Feldman, R., Goldenberg, J., and Fresko, M., 2012, “Mine Your
Cascading*,” J. Prod. Innovation Manage., 22(1), pp. 42–62. Own Business: Market-Structure Surveillance Through Text Mining,” Mark.
[8] Shiau, C.-S. N., and Michalek, J. J., 2009, “Should Designers Worry About Sci., 31(3), pp. 521–543.
Market Systems?,” ASME J. Mech. Des., 131(1), p. 011011. [39] Raeder, T., and Chawla, N. V., 2011, “Market Basket Analysis With
[9] Morrow, W. R., Long, M., and MacDonald, E. F., 2014, “Market-System Networks,” Social Network Anal. Min., 1(2), pp. 97–113.
Design Optimization With Consider-Then-Choose Models,” ASME J. Mech. [40] Moe, W. W., 2006, “An Empirical Two-Stage Choice Model With Varying
Des., 136(3), p. 031003. Decision Rules Applied to Internet Clickstream Data,” J. Mark. Res., 43(4),
[10] Resende, C. B., Heckmann, C. G., and Michalek, J. J., 2012, “Robust Design for pp. 680–692.
Profit Maximization With Aversion to Downside Risk From Parametric Uncer- [41] Newman, M. E., and Girvan, M., 2004, “Finding and Evaluating Community
tainty in Consumer Choice Models,” ASME J. Mech. Des., 134(10), p. 100901. Structure in Networks,” Phys. Rev. E, 69(2), p. 026113.
[11] Train, K. E., 2009, Discrete Choice Methods With Simulation, Cambridge [42] Weinstein, A., 1994, Market Segmentation: Using Demographics, Psycho-
University. graphics and Other Niche Marketing Techniques to Predict and Model Cus-
[12] Ben-Akiva, M. E., and Lerman, S. R., 1985, Discrete Choice Analysis: Theory tomer Behavior, Probus Publishing Company, Chicago.
and Application to Travel Demand, MIT, Cambridge. [43] Eshghi, A., Haughton, D., Legrand, P., Skaletsky, M., and Woolford, S., 2011,
[13] U.S. Department of Transportation, F. H. A., 2009, “National Household Travel “Identifying Groups: A Comparison of Methodologies,” J. Data Sci., 9, pp.
Survey,” http://nhts.ornl.gov 271–291.
[14] Hauser, J. R., and Wernerfelt, B., 1990, “An Evaluation Cost Model of Consid- [44] Lemp, J. D., and Kockelman, K. M., 2012, “Strategic Sampling for Large
eration Sets,” J. Consum. Res., 16(4), pp. 393–408. Choice Sets in Estimation and Application,” Transp. Res. Part A: Policy Pract.,
[15] Hauser, J. R., Ding, M., and Gaskin, S. P., 2009, “Non-Compensatory (and 46(3), pp. 602–613.
Compensatory) Models of Consideration-Set Decisions,” Sawtooth Software [45] Hyndman, R. J., and Koehler, A. B., 2006, “Another Look at Measures of Fore-
Conference, Sequim. cast Accuracy,” Int. J. Forecasting, 22(4), pp. 679–688.
[16] Nerella, S., and Bhat, C. R., 2004, “Numerical Analysis of Effect of Sampling [46] Fruchterman, T. M. J., and Reingold, E. M., 1991, “Graph Drawing by Force-
of Alternatives in Discrete Choice Models,” Transp. Res. Rec.: J. Transp. Res. Directed Placement,” Software: Pract. Exp., 21(11), pp. 1129–1164.
Board, 1894(1), pp. 11–19. [47] Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., and Bhattacharjee, B.,
[17] McFadden, D., 1978, Modelling the Choice of Residential Location, Institute of 2007, “Measurement and Analysis of Online Social Networks,” The 7th ACM
Transportation Studies, University of California, Berkeley, CA. SIGCOMM Conference on Internet Measurement, ACM, New York, pp.
[18] Peters, T., Adamowicz, W. L., and Boxall, P. C., 1995, “Influence of Choice 29–42.
Set Considerations in Modeling the Benefits From Improved Water Quality,” [48] Hartigan, J. A., and Wong, M. A., 1979, “Algorithm AS 136: A k-Means Clus-
Water Resour. Res., 31(7), pp. 1781–1787. tering Algorithm,” Appl. Stat., 28(1), pp. 100–108.
[19] Williams, H., and Ort uzar, J. D., 1982, “Behavioural Theories of Dispersion [49] Krackhardt, D., 1988, “Predicting With Networks: Nonparametric Multiple
and the Mis-Specification of Travel Demand Models,” Transp. Res. Part B: Regression Analysis of Dyadic Data,” Social Networks, 10(4), pp. 359–381.
Methodol., 16(3), pp. 167–219. [50] Bass, F. M., 1974, “The Theory of Stochastic Preference and Brand Switching,”
[20] Shocker, A. D., Ben-Akiva, M., Boccara, B., and Nedungadi, P., 1991, J. Mark. Res., pp. 1–20.
“Consideration Set Influences on Consumer Decision-Making and Choice: [51] Clauset, A., Newman, M. E., and Moore, C., 2004, “Finding Community Struc-
Issues, Models, and Suggestions,” Mark. Lett., 2(3), pp. 181–197. ture in Very Large Networks,” Phys. Rev. E, 70(6), p. 066111.
[21] Parsons, G. R., and Kealy, M. J., 1992, “Randomly Drawn Opportunity Sets in [52] Csardi, G., and Nepusz, T., 2005, The Igraph Software Package for Complex
a Random Utility Model of Lake Recreation,” Land Econ., 68(1), pp. 93–106. Network Research, InterJournal, Complex Systems, http://igraph.org
DownloadedViewFrom:
publicationhttps://mechanicaldesign.asmedigitalcollection.asme.org/
stats on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

A Data-Driven Network Analysis Approach To Predicting Customer Choice Sets For Choice Modeling in Engineering Design

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Data-Driven Network Analysis Approach To Predicting Customer Choice Sets For Choice Modeling in Engineering Design

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Data-Driven Network Analysis Approach to Predicting Customer Choice

Article in Journal of Mechanical Design · July 2015

Mingxian Wang Wei Chen

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Journal of Mechanical Design Copyright V

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

071410-2 / Vol. 137, JULY 2015 Transactions of the ASME

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

3.1 Mining Product Association Relations and Customer

Fig. 2 Illustration of vehicle association network and product

Journal of Mechanical Design JULY 2015, Vol. 137 / 071410-3

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

071410-4 / Vol. 137, JULY 2015 Transactions of the ASME

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

Rules for continuous attr. Heterogeneity % of customers (%)

Journal of Mechanical Design JULY 2015, Vol. 137 / 071410-5

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

071410-6 / Vol. 137, JULY 2015 Transactions of the ASME

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

Journal of Mechanical Design JULY 2015, Vol. 137 / 071410-7

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

is performed based on the customer-specific product consideration

Table 3 Customer segments generated by K-means clustering based on S attributes

G-1 716 0.781 1.001 0.662 0.328 0.306 0.056 0.051

071410-8 / Vol. 137, JULY 2015 Transactions of the ASME

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

Model 1 Model 2 Model 3

MNL w/ universal MNL w/ randomly sampled MNL w/ predicted choice set

Note: All estimates are statistically significant at the 0.01 level.

Journal of Mechanical Design JULY 2015, Vol. 137 / 071410-9

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

071410-10 / Vol. 137, JULY 2015 Transactions of the ASME

Downloaded From: https://mechanicaldesign.asmedigitalcollection.asme.org/ on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use

Journal of Mechanical Design JULY 2015, Vol. 137 / 071410-11

You might also like