Professional Documents
Culture Documents
net/publication/277578369
CITATIONS READS
43 729
2 authors:
All content following this page was uploaded by Mingxian Wang on 18 January 2016.
Keywords: choice set, choice modeling, customer preference, product association, data
analytics, network analysis
1 Introduction collected. Approach has been taken to either use all vehicles in
the market as the choice set or randomly pick a few vehicles2 to
Customer choice modeling is gaining increasing attention in
form individual choice sets [16,17]. While both treatments may be
engineering design because it allows prediction of future product
reasonable under certain contexts [18], studies show that misspe-
demand as a function of engineering design attributes and the tar-
cification of choice sets may result in inferior choice model esti-
get market description [1]. Recent efforts toward incorporating
mates, especially when a large set of choice alternatives exist
customer preferences into engineering design have addressed a
[19,20]. By contrast, it has been shown that good predictions of
wide spectrum of design interests such as platform-based product
reduced choice sets can better approximate individual choice
family design [2], hierarchical systems design [3], usage and
behaviors and thus yield improved modeling results [21].
social context based design [4,5], multilevel and multidisciplinary
In this paper, we propose a data-driven network analysis
design [6,7], design under market competition [8], consider-then-
approach to predict individual choice sets for choice data with
choose model to design optimization [9], and robust design and
missing choice set information. Given a customer’s profile and the
sensitivity evaluation [10]. Although all of the aforementioned
final choice made, an individual’s choice set is predicted based on
studies involve the modeling of customer choices, the practical
existing choice set data for a similar product. In the aforemen-
implementation is challenged by the availability of choice set data
tioned vehicle example, choice set information gathered from a
[11,12].
different source, e.g., J.D. Power vehicle survey (JDPA), can be
Choice set, in the context of engineering design, is defined as
used for learning and further predicting choice sets in order to use
“a set of product alternatives available to an individual who will
the NHTS data for choice modeling. The key idea of our approach
seriously evaluate through comparisons before making a final
is to construct a customer association network of products using
choice.” The choice set information is often unavailable in the
existing choice set survey data. Network links and their strengths
collected market data, but is rather important for building choice
reflect the proximity or similarity of two products in customers’
models. For example, for vehicle choice modeling, the National
perceptual space, indicating what product alternatives are more
Household Travel Survey (NHTS) is a rich set of data that records
likely to be considered together by a customer. This line of rea-
vehicle ownerships, demographics of households, and detailed in-
soning is supported by rich literature in cognitive psychology
formation on customer travel behaviors in the United States [13];
[22,23] and is similar to the co-occurrence analysis in mining
however, information on individual’s vehicle choice set was not
association rules [24,25]. Taking vehicle as an example, if the
strength of network link is larger between “Toyota Camry” and
1
Corresponding author. “Honda Civic” than that between Toyota Camry and “BMW 7,”
Contributed by the Design Automation Committee of ASME for publication in
the JOURNAL OF MECHANICAL DESIGN. Manuscript received September 9, 2014; final
2
manuscript received March 10, 2015; published online May 19, 2015. Assoc. Editor: Studies show that the number of vehicles a customer seriously considers is often
Bernard Yannou. in the range of 3-6. [14,15]
alternatives and customer demographics S. When the prediction 4.2 Choice Model Specification and Choice Response
data only contains a subset of alternatives that belong to the min- Simulation. Given the simulated “true” choice sets described
ing choice set data, the group consideration frequency Nk0 Gn for above, the purchase decision is made based on the random utility
the irrelevant products should be set to zero to avoid being as defined in Eq. (1). The prespecified utility function for this case
sampled in prediction. When a new product appears in the new study is shown in Eq. (10), where the systematic component Wnk
data but is not found in the existing choice set data, the new prod- is a linear sum of four product attributes A1–A4. The error compo-
uct will be assumed to be independent from other products and nent enk for all k 2 CSn is randomly drawn from a Gumbel distri-
treated as an isolated node in the association network. Information bution with a variance of p2 =6
of the current market share of the new product will be used for
describing the sampling distribution. Specifically, one can use the Wnk ¼ b1 Ak1 þ b2 Ak2 þ b3 Ak3 þ b4 Ak4 ; if k 2 CSn (10)
choice frequency in the prediction data to approximate Nk0 Gn in
Eq. (6) and compute CPnk in Eq. (9). An illustrative example We study two levels of model noise (error component) on the per-
using two different sets of market data is provided in Sec. 5.4. formance of the proposed approach. Following the method in Ref.
Similarly, for clustering customers based on demographic attrib- [44], two scenarios are examined by adjusting the model coeffi-
utes S, if attributes in the two datasets do not completely overlap, cients in the utility function. In the high noise scenario, parameter
only the shared attributes will be used in clustering. In the extreme coefficients on each of the explanatory variables are assumed to
case where no customer information is available, one can still be (1, 1, 1, 1). By holding the random component’s error term
implement the approach by treating all customers as a homogene- constant, a “low noise” scenario is also created when doubling the
ous group, as shown by our simulation study in Sec. 4. model coefficients to (2, 2, 2, 2), which relatively decreases the
contribution of the random component on the overall utility. Cus-
4 Case Study of Synthetic Data tomer attributes Sn are omitted in the utility function for simplic-
ity. Choice response for each individual customer is determined
The goal of our synthetic data study is to evaluate the improve- by choosing the product alternative that maximizes the individual
ment of choice modeling using the proposed approach for choice utility Unk.
set prediction. Using the synthetic choice set data and the prede-
fined choice model parameters, we are able to measure the estima- 4.3 Mining the Simulated Choice Sets Data. Using the
tion bias of the model parameters as a result of choice set simulated choice set and choice response data, a training set and a
misspecifications under different noise scenarios. testing set are formed: the first 5000 half choice observations as
well as the corresponding true choice sets data are used to learn
4.1 Generation of Choice Set Data. We generate a popula- product associations and customer consideration frequencies. Pre-
tion of 10,000 customers who use different consideration rules dictions of choice sets are carried out for the second 5000 half
over a total of 100 product alternatives. Product alternatives are choice observations where the choice sets are assumed to be
described by four explanatory variables (A1–A4), three of which unknown. Using the lift association analysis introduced in Sec.
(A1, A2, and A3) are continuous variables while the remaining one 2.2, 1626 pairs of relations are identified, among which 285
(A4) is a binary variable. Literature shows that customers often (17.53%) links with a value below 1 are removed from the net-
develop heuristic decision rules to select products in forming their work to reduce the noises embedded in the data. To identify prod-
choice sets [27]. Five hypothetical decision rules covering all four uct communities, the Newman’s optimal modularity method [41]
product attributes are created for choice set formation. Using vehi- is employed and solved using a greedy algorithm [51] in R envi-
cle as an example, rules can be associated with the price and fuel ronment with the igraph package [52]. The algorithm identifies
economy. As shown in Table 1, we define Rule 1 and Rule 3 by three product communities and one isolated node within the 60
setting lower thresholds on attributes A1 and A3. Rule 2 and Rule considered product alternatives. Since customer attributes Sn are
4 are associated with upper thresholds on A2 and A3. The hetero- not considered in this choice model, all customers are treated
geneity of a decision rule among customers is modeled by the identically and combined into one single cluster.
threshold value, which is randomly sampled from a bounded uni-
form distribution whose range is defined by the attribute level of 4.4 Prediction of Choice Sets for Simulated Data. Follow-
existing products. For example, the bounded uniform distribution Following the approach described in Sec. 3.2, the choice set sam-
for threshold2H in Rule 2 guarantees that the sampled threshold pling process proceeds by computing the customer-specific
value is higher than at least 40% of low-end products (U1 ð0:4Þ) product consideration probabilities CPnk (k0 ) in Eqs. (6)–(9). Once
and lower than at least 10% of available products (U1 ð0:9Þ) in the sampling distribution is determined, alternative products are
the data, where U1 ðÞ stands for the standardized normal distribu- drawn sequentially to generate the predicted choice sets. The num-
tion function. In generating the synthetic data, we also take into ber of sampled products for each choice set is fixed at 12, which is
the fact that not all customers will consider all rules. The fraction determined by the average choice set size from the training data.
of customers who consider a particular rule is shown in Table 1. A The process is repeated for every customer in the test data.
customer may consider all five rules or none of the rules listed.
The simulated data have an average choice set size of 13, among 4.5 Estimation of Parameter Bias for Choice Models. To
the 60 out of 100 alternatives being considered one or more times. assess the effectiveness of the predicted choice sets for improving
the average product is considered infrequently with the majority The above analysis presents the advantages of using network
of its neighbors, and frequently with only a few in the market. To techniques to simultaneously measure the association of 262 car
better understand the structural implications in a product associa- models based on the customer data. Beyond traditional association
tion network, we compared the products with high centralities to analysis, the network technique allows us to analyze and visualize
the products with high occurrence frequency in choice sets. The a large number of product alternatives, and thereby provides us
overlap rate of the top-30 vehicles in terms of node degrees (cen- with a better visual understanding of the positions and roles of
trality) and occurrence frequency is 50.0%. This indicates the various products in a market. Furthermore, commonality and
products that are widely considered are not necessarily the most differences of the product features within and across product com-
popular alternatives in choice sets. The most compared-against munities can be further examined to draw insights into noncom-
vehicle model is “Nissan Altima,” which has been compared at pensatory criteria customers use and guide product designs.
least once to other 117 vehicle models. Toyota Camry and “Honda
Accord” are the top two vehicle models in terms of the frequency 5.1.2 Vehicle Community Structure in a Weighted Network.
of occurrence in choice sets. “Toyota FJ Cruiser” and “Toyota To further explore the vehicle association structures, we inspect
Avalon” are two of the top-10 vehicle models in degree centrality, the community of vehicles considered frequently together within
but they are not on the top-30 most considered vehicles based on a group but less frequently with other groups of vehicle models.
the training set of the JDPA data. Six communities are identified within the large interconnected
Figure 6(a) provides a visualization of the processed network, component, as shown in Fig. 6(a) with different colors. The com-
where different colors stand for multiple vehicles communities munity size varies from 10 vehicles (3.8% of total vehicles) to 78
identified. The layout of the network is optimized using the vehicles (29.8% of total vehicles). The type of vehicles falling
Fruchterman–Reingold algorithm [46], where nodes are posi- into each identified community is listed in Table 2. For example,
tioned in an aesthetically pleasing way, but not inversely propor- community C-1 contains 56 vehicles among which 39 are passen-
tional to the strength of link. The network comprises a giant ger cars (CAR) and 17 are sport utility vehicles (SUV). By con-
interconnected component and 17 isolated nodes. The presence of trast, multiactivity vehicles (MAV) dominate the community C-2
isolated vehicle alternatives is caused by its low frequency of over other types of vehicles. This implies that the grouped vehicle
occurrence in choice sets and its random association pattern to communities show a highly organized structure in terms of vehicle
other vehicles. As noted, this vehicle association network has a types, but sometimes customers do consider different types of
remarkably low density (0.075), while other network statistics are vehicles together. Similar analysis can be performed on other ve-
comparable to a typical social network of the same size [47]. hicle features to study the importance of vehicle features in cus-
tomer’s consideration process.
5.1.3 Customer Segments. In customer clustering, seven cus-
tomer profile attributes S, including five sociodemographic attrib-
utes and two usage context attributes, are extracted for each
customer from the training data. Note that “Gender” is coded as 1
for male and 2 for female, and all other attributes are coded by
their numeric levels. All of the variables are standardized (cen-
tered to zero and scaled by dividing its standard deviation) before
clustering. The K-means algorithm [48], a basic unsupervised
clustering method, is used for customer clustering. Eight clusters
are identified that achieve a good balance between the classifica-
tion quality and the model efficiency.
Table 3 shows the cluster size and the cluster mean values of
the standardized customer profile variables S. The identified clus-
tering pattern shows both the customer heterogeneity and the cor-
relation between different profile attributes. For example,
customers in cluster G-5 are older people with small family size,
while customers in cluster G-6 tend to have higher income and
higher education level. To extract more useful patterns from the
Fig. 6 Vehicle association network with colored communities customer cluster data, other statistical techniques, such as post-
and structural properties: (a) network derived using training clustering methods and discriminant analysis techniques, can be
choice sets and (b) network derived using predicted choice performed to identify key customer characteristics that differenti-
sets ate the clusters.
Choice set community no. Size % of total vehicles (%) CAR MAV MINI CAR MINI-VAN PICKUP SUV
C-1 56 21.4 39 0 0 0 0 17
C-2 78 29.8 7 69 0 0 2 0
C-3 48 18.3 30 11 0 0 0 7
C-4 42 16.0 22 3 9 1 0 7
C-5 10 3.8 0 0 0 10 0 0
C-6 11 4.2 0 0 0 0 11 0
Isolates 17 6.5 5 5 0 1 4 2
Total 262 103 88 9 12 17 33
Customer cluster No. size Sgender Sage Sincome Schildren # Seducation Slocal/hwy usage Smiles driven daily
Table 4 Estimates from multiple MNL models with different choice set specifications using validation data
Attributes
Aprice/Sincome 0.1921 (0.0004) 0.1998 (0.0005) 0.1183 (0.0007)
Vehicle origin (domestic as base)
AEuropean 0.2396 (0.0016) 0.0584 (0.0018) 0.0952 (0.0020)
AJapanese 0.6510 (0.0008) 0.5745 (0.0010) 0.9503 (0.0011)
AKorean 0.4043 (0.0019) 0.5090 (0.0022) 1.2703 (0.0021)
Vehicle type (car as base)
AMAV 0.0322 (0.0009) 0.0064 (0.0011) 0.2076 (0.0016)
Amini car 0.5957 (0.0025) 0.1938 (0.0031) 0.9573 (0.0029)
Aminivan 0.1212 (0.0016) 0.1712 (0.0020) 0.2620 (0.0033)
Apickup 0.5503 (0.0013) 0.5211 (0.0017) 0.4199 (0.0036)
ASUV 0.4696 (0.0018) 0.5090 (0.0019) 0.0522 (0.0024)
AHEV 2.8948 (0.0066) 2.1273 (0.0082) 1.8395 (0.0085)
AHEV Sfuel price 0.6814 (0.0024) 0.7041 (0.0030) 0.7308 (0.0032)
Afootprint 1.4403 (0.0042) 0.8816 (0.0049) 1.6733 (0.0060)
Afootprint Schildren 0.8437 (0.0018) 1.0015 (0.0023) 0.8928 (0.0030)
AMPG 0.1616 (0.0001) 0.0496 (0.0001) 0.0368 (0.0001)
Ahorsepower 0.0036 (0.0000) 0.0035 (0.0000) 0.0032 (0.0000)
DownloadedViewFrom:
publicationhttps://mechanicaldesign.asmedigitalcollection.asme.org/
stats on 01/18/2016 Terms of Use: http://www.asme.org/about-asme/terms-of-use