You are on page 1of 12

Journal of Choice Modelling 31 (2019) 198–209

Contents lists available at ScienceDirect

Journal of Choice Modelling


journal homepage: www.elsevier.com/locate/jocm

Airline itinerary choice modeling using machine learning


Alix Lheritier *, Michael Bocamazo , Thierry Delahaye, Rodrigo Acuna-Agost
Innovation and Research Division, Amadeus S.A.S., 485 Route du Pin Montard, 06902 Sophia Antipolis Cedex, France

A R T I C L E I N F O A B S T R A C T

Keywords: Understanding how customers choose between different itineraries when searching for flights is
Multinomial logit model very important for the travel industry. This knowledge can help travel providers, either airlines or
Latent class multinomial logit model travel agents, to better adapt their offer to market conditions and customer needs. This has
Machine learning a particular importance for pricing and ranking suggestions to travelers when searching for flights.
Decision tree
This problem has been historically handled using Multinomial Logit (MNL) models. While MNL
Random forest
models offer the dual advantage of simplicity and readability, they lack flexibility to handle col-
linear attributes and correlations between alternatives. Additionally, they require expert knowl-
edge to introduce non-linearity in the effect of alternatives’ attributes and to model individual
heterogeneity. In this work, we present an alternative modeling approach based on non-para-
metric machine learning (ML) that is able to automatically segment the travelers and to take into
account non-linear relationships within attributes of alternatives and characteristics of the deci-
sion maker. We test the models on a dataset consisting of flight searches and bookings on Euro-
pean markets. The experiments show our approach outperforming the standard and the latent class
Multinomial Logit model in terms of accuracy and computation time, with less modeling effort.

1. Introduction

There is a growing interest within the travel industry in better understanding how customers choose between different itineraries
when searching for flights. Such an understanding can help travel providers, either airlines or travel agents, to better adapt their offer to
market conditions and customer needs, thus increasing their revenue. This can be used for filtering alternatives, sorting them or even for
changing some attributes in real-time (e.g., changing the price).
The field of customer choice modeling is dominated by traditional statistical approaches, such as the Multinomial Logit (MNL)
model, that are linear with respect to attributes and are tightly bound to their assumptions about the error distribution. While these
models offer the dual advantage of simplicity and readability, they lack flexibility to handle collinear attributes and correlations be-
tween alternatives. A large part of the existing modeling work focuses on adapting these modeled distributions, so that they can match
observed behavior. Nested (NL) and Cross-Nested (CNL) Logit models are good examples of this: they add terms for highly specific
attribute interactions, so that substitution patterns between sub-groups of alternatives can be captured. Another strong limitation is their
inability to model different behavior according to individual specific variables. Latent Class MNL models (Greene and Hensher, 2003)
extend MNL to take into account different individual segments whose number must be specified in advance. In these models, class
membership is also represented with a Multinomial Logit, therefore suffering the same drawbacks mentioned before.
In our particular problem, there are two main segments to be taken into account. Business and leisure air passengers behave very

* Corresponding author.
E-mail addresses: alix.lheritier@amadeus.com (A. Lheritier), michael.bocamazo@amadeus.com (M. Bocamazo), thierry.delahaye@amadeus.com (T. Delahaye),
rodrigo.acunaagost@amadeus.com (R. Acuna-Agost).

http://dx.doi.org/10.1016/j.jocm.2018.02.002
Received 29 May 2017; Received in revised form 16 February 2018; Accepted 16 February 2018
Available online 16 March 2018
1755-5345/© 2018 Elsevier Ltd. All rights reserved.
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

different when it comes to booking flights. Business passengers tend to favor alternatives with convenient schedules such as shorter
connection times and time preferences. On the other hand, leisure passengers are very price sensitive, which means that they can accept
a longer connection time if this is reflected in a lower price of the tickets. The problem is that the customer segment is not explicitly
known when the traveler is searching, however it could be derived by combining different factors. For example, industry experts know
that business passengers have a tendency to book with lower lead time and are not predisposed to stay on Saturday nights. However,
these are not “black or white” rules, which reinforces the need for a model able to detect these patterns depending on the data and actual
customer behavior.
In this work, we propose and evaluate a novel approach to choice modeling through the use of machine learning techniques. The
selected machine learning methods can model non-linear relationships between feature values and the target class, allow collinear
features, and have more modeling flexibility to automatically learn implicit customer segments.
Decision trees are well adapted to our problem as model bifurcations (branches) are well-disposed to automatically partition the
customers into hierarchical segments and, at the same time, capture non-linear relationships within attributes of alternatives and
characteristics of the decision maker, if this has a positive impact on the prediction accuracy. In particular, we have chosen to work with
machine learning methods based on ensembles of decision trees, namely Random Forests (RF) (Breiman, 2001), which further improve
the performance over unseen data.
In contrast to previous applications of MNL to air itinerary choice (e.g., (Newman et al., 2016)), our dataset is much more complex.
We consider round-trip alternatives, multiple markets (origin/destination pairs), different travelers profiles, and different points of sale;
all together in a single data set of sessions. Thus, alternatives differ between sessions and they cannot easily be labeled in few discrete
categories in contrast to classical MNL examples such as {car, bus, air}. Moreover, the number of alternatives is not constant and it varies
up to 50 in addition to the fact that some of them could be highly correlated (e.g., same outbound but different inbound flights). This
increases the complexity and makes the weaknesses of MNL more evident.
The rest of this paper is organized as follows: Section 2 presents a literature review of related works. Section 3 introduces the
background and problem formulation for the experiments. Section 4 presents a novel approach based on machine learning. Section 5
presents numerical experiments and their results. Finally, Section 6 draws the main conclusions and present some perspectives of this
work.

2. State of the art on air itinerary choice modeling

The first work studying air itinerary choices appeared in (Proussaloglou and Koppelman, 1999). The paper develops air traveler
choice models to analyze the tradeoffs decision makers make when choosing among different carriers, flights, and fare classes. This work
includes the analyses of two different segments, business and leisure, and demonstrates the higher price sensitivity of leisure travelers
and the greater importance of convenient schedules to business travelers. A main weakness of this work is the lack of realistic and
representative data. The authors use a combination of mail and telephone surveys to get the data and focus on only two routes in the US.
Extensions of this work are presented in (Coldren et al., 2003; Coldren and Koppelman, 2005a, 2005b). These studies are based on
aggregate air-travel itinerary choice models estimated at the city-pair level for all city-pairs in the US market. These models determine
the attributes of the alternatives that influence the choices of itineraries. The models are estimated using aggregate multinomial logit
methodology (Coldren et al., 2003) or generalized extreme value (GEV) models (Coldren and Koppelman, 2005a). An improvement of
the estimations is provided by the same authors in (Coldren and Koppelman, 2005b), consisting in a more realistic time-of-day com-
petition dynamic modeled with ordered generalized extreme value (OGEV) and hybrid OGEV models.
More extensions of these works were presented in (Carrier, 2008). The author proposes a latent choice MNL model including
a segmentation from the trip characteristics. Another interesting aspect of this work is the data. The choice set is reconstituted from past
bookings, flight availability snapshots, and fare rules.
Similarly (Parker et al., 2005), evaluates the effect of stops, travel time, fare, and a number of related features. The main applications
of the proposed model concern the use in the analysis of fleet and equipment requirements. The method is based on a two-step MNL
choice model and takes total demand amount as input. The first step is used to eliminate alternatives with low shares, and the second to
compute the final expected shares. The market shares are computed at an aggregate level. In the same line (Bhadra and Hogan, 2005),
presents an MNL for predicting the share of itineraries with a given number of stops (direct, one, or two stops), on a given city pair,
depending of factors such as the distance, average fare, and the presence of full service carriers (FSC) or low cost carriers (LCC). Neither
of these two approaches look at individual itineraries either but at an aggregate level.
The recent works (Lurkin, 2017; Lurkin et al., 2017) involve the estimation of airline itinerary choice models (MNL, NL, and OGEV
models) on continental U.S. flights occurred in May 2013 at an aggregate level. They propose a correction for price endogeneity using
a control function that uses several types of instrumental variables.
There are common weaknesses of all these previous works: they consider aggregations and oversimplification (e.g., one-way trips
mostly in very few routes). In (Delahaye et al., 2017), the authors overcome these issues with a two-stage approach to predict travelers’
choice behavior by combining machine learning and discrete choice-modeling techniques using data obtained by merging search logs
and real bookings on several European markets. The authors report significant gains in choice prediction accuracy compared to a single
(e.g. MNL) stage procedure.

3. Problem formulation and background

In this section, we formulate the choice modeling problem and present two classical tools to solve it.

199
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

3.1. Problem formulation

Each individual i considers a choice set of ni alternatives from which he must choose exactly one, with j ¼ 1…ni indexing each
alternative. We consider that the choice sets faced by each individual can be completely different from each other. We denote Ci the
random variable representing the index of the chosen alternative. The choice set verifies the three basic conditions: a) mutually
exclusive, b) exhaustive, and c) be composed of a finite number of alternatives.
In each choice set, two kinds of feature tuples are considered:

 xi0 , characterizing the individual (also called characteristics), and


 xij , characterizing the alternatives (also called attributes).

The tuples can be fully numeric or a mix of categorical and numeric types.
We denote the tuple of all the features that can be considered for the alternative j of the choice situation i as aij ≡xi0 xij .

3.2. Multinomial logit model

The Multinomial Logit Model (MNL) is derived under the assumption that a decision maker chooses the alternative that maximizes
his utility U. In general it is not possible to know the decision maker's utility nor its form. However, one can determine some attributes of
the alternatives as faced by the passenger, in addition to some characteristics of the decision maker. For a linear-in-parameters MNL, the
feature tuples need to be fully numeric and, thus, we treat them as numeric vectors. In general we consider:

 alternative specific features xij with generic coefficients β


 individual specific features xi0 with alternative specific coefficients γ j
 alternative specific features xij with alternative specific coefficient δj
 alternative constants αj .

We can now specify a model or function that relates these observed factors to the unknown individual's utility. This function is often
called representative utility, and defined as:

Vij ¼ αj þ β⋅xij þ γ j ⋅xi0 þ δj ⋅xij : (1)

At this point, it should be noted that there are aspects of the utility that cannot be observed nor derived from the available data,
therefore Vij ≠Uij . Therefore, the utility could be decomposed as:

Uij ¼ Vij þ εij (2)


where εij encapsulates all the factors that impact the utility but are not considered in Vij . It should be noted that we do not know εij and
this is the reason why these terms are treated in the literature as random. In particular, we assume εij are i.i.d. with an extreme value
distribution. Under these assumptions, the resulting probability of choosing the j-th alternative is given by (McFadden, 1973)

   eVij
P Ci ¼ jxij ; j ¼ 0…ni ¼ P Vij : (3)
je

For our particular application, alternatives vary drastically from one session to another, for this reason alternative specific co-
efficients and constants are ruled out (i.e., αj ¼ γ j ¼ δj ¼ 0 ∀j).

3.3. Latent class multinomial logit model

The latent class MNL model (Greene and Hensher, 2003) takes into account individual heterogeneity by considering different classes
in which homogeneity is assumed. The number of classes q is specified as a parameter. Let QðiÞ be the index of the class of the individual
i. Then, under this model, the probability of choosing the j-th alternative is given by the following mixture:

   Xq
  
P Ci ¼ jxij ; j 2 0…ni ¼ PðQðiÞ ¼ cjxi0 ÞP Ci ¼ jQðiÞ ¼ c; xij ; j ¼ 1…ni (4)
c¼1

where both probabilities on the right side are given by MNL models.

4. Machine learning based choice modeling

In this section, we first introduce supervised machine learning. Afterwards, we provide details on how we reformulate our choice
modeling problem as a supervised learning one. Finally, we introduce random forests (RF), which is the machine learning technique
selected for our itinerary selection problem.

200
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

4.1. Supervised learning

In supervised learning, each sample point consists in a tuple of features and a target denoted xk ; yk , respectively. The goal is, after
seeing a set of sample points, to predict new unseen targets given their associated features. In the soft classification problem, the targets
are discrete (and usually called labels) and the goal is to estimate their conditional probability distribution PðYjXÞ while trying to
minimize some measure of the expected error when predicting a new unseen label. This is in contrast to hard classification where the goal
is to provide a predicted label only.
More precisely, it is assumed that the sample points are independent realizations of a pair of random variables ðX; YÞ, taking values
on some sets Ω (e.g., Ω≡ℝd ) and A (e.g., A ≡f0; 1g for binary classification), respectively.

4.2. Choice modeling as a supervised learning problem

Our approach to choice modeling is to consider each alternative independently and treat it as in a soft classification problem, i.e.,
given its set of feature values, predict whether it will be chosen or not. That is, Xk will be instantiated with the tuple aij and Yk with
P
1fj¼Ci g , where k ¼ j þ i1
l¼1 nl . Therefore, we will train the classification models as if the pairs ðXk ; Yk Þ were independent and identically
distributed, which is not our case since in every choice situation there is one and only one alternative chosen. Therefore, we will treat the
probabilities as scores that we will use to rank the alternatives.

4.3. Random forests

Random Forests (RF) (Breiman, 2001) is a supervised learning algorithm consisting of a collection of tree-structured learners, i.e.
decision trees. Decision trees are well-known to be prone to overfitting. RF achieves better generalization by growing an ensemble of
b independent decision trees on random subsets of the training data and by randomizing the features made available to each tree node
during training in order to reduce the variance (see, e.g., (Hastie et al., 2009, Chap.15) for more details). We chose to work with RF since
they are relatively quick to train and they do not require hyper-parameter tuning to obtain good performances (see, e.g., (Caruana et al.,
2008)).
Training stage. The training data consists of a set of sample pairs D train ¼ fðxk ; yk Þg. An appealing feature of RF is that the sample
points xk can be tuples containing categorical and numeric values.
During training, each tree is fit using a random subset of D train . At each internal node, a binary test on one variable is performed to
split the data over its child nodes. The test functions can be of two forms depending on the nature of the feature used for the split. Let us
assume that the m-th feature is used for splitting and denote the x½m denote its value in the tuple x. If the m-th feature is numeric, the test
function is of the form x½m > θ, where θ is some threshold. If the m-th feature is categorical with values in a set A m , the test function is of
the form x½m 2 S, where S is some set from the power set 2A m . The sample x is sent to one of the two child nodes based on the test's
outcome.
Training the classifier means selecting the most discriminative feature and binary test for each node by locally optimizing over θ or S,
depending on the type of the feature, to minimize an impurity criterion (in our case, the entropypof ffiffiffi the conditional distribution) on the
data partition. During this procedure, only a random subset of the d features (usually, of size dÞ is available for internal node opti-
mization. A tree can be grown fully (i.e. until there is only one sample point in each leaf) or, to limit computational complexity, it can be
restricted by a maximum depth or a minimum number of points in a leaf. Nodes where tree growth stops are called leaf nodes.
At a leaf node, for soft classification, the empirical conditional distribution of a target given the feature values that lead to the leaf is
computed.
Prediction stage. When applied to unseen data set D test ¼ fxk g, each sample point xk is propagated through all the trees by successive
application of the binary tests. In the t th tree, a conditional probability estimate is obtained from the leaf lt ðxk Þ that the sample falls into.
The final conditional probability estimate is obtained as the average across all the b trees:

X
b
b ¼ 1jxk Þ ¼ 1
Pðy b lt ðxk Þ ðy ¼ 1Þ
P (5)
b t¼1

b l ðx Þ ð⋅Þ is the empirical probability conditioned on the leaf node lt ðxk Þ.


where P t k

4.4. Feature importance

RF lack the coefficient readability of linear models. Nevertheless, an interesting characteristic of RF is their capacity to provide
a measurement of feature importance. Some features are more informative than others, or have splits in the ensemble that reduce the
impurity more. Let N t ðmÞ denote the set of nodes of the t-th tree using the m-th variable to split and let pη be the proportion of training
samples reaching some node η. Let ΔiðηÞ denote the reduction in impurity yielded by the split at η. The importance of the m-th variable is
defined as

1X b X
ImpðmÞ≡ pη ΔiðηÞ: (6)
b t¼1 η2N ðmÞ
t

201
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

Table 1
Features: marked with y, alternative specific features (used in Experiment 1 and 2) and, marked with z, segment predictors (used
in Experiment 2). For more details, see Sections 5.2 and 5.3.

Feature Type Range/Cardinality

y Price Num [77.15,16781.52]


y Trip duration (minutes) Num [105,4314]
y Stay duration (minutes) Num [120,434000]
y Direct Bin {0,1}
y Interline Bin {0,1}
y Contains Low-Cost Carrier (LCC) Bin {0,1}
y Outbound departure time Num [0, 86400]
y Outbound arrival time Num [0, 86400]
z Stay duration (minutes) (median) Num [125,433240]
z Days to departure (DTD) (median) Num [0, 343]
z Stay Saturday (median) Bin {0,1}
z Origin Cat 16
z Destination Cat 28

Then, this measure can be normalized to represent the share of each feature in informativeness, which is one way of understanding
feature importance.

5. Experiments and results

Two experiments were done to demonstrate the performance of the proposed ML-based method. Section 5.1 describes the data and
the experimental setup. Section 5.2 compares MNL with ML based on alternative specific attributes only, focusing on how non-linearities
are captured in both methods. Section 5.3 compares latent class MNL and ML, showing how individual heterogeneity is captured in both
methods.

5.1. Experimental setup

Data. We used a dataset consisting of flight searches and bookings on a set of European origin/destination markets and airlines,
extracted from global distribution system (GDS) logs. Each choice set corresponds to a search session consisting of the results of a flight
search request. Each search session contains up to 50 different itineraries, one of which has been booked by the customer. In total, there
are 33951 choice sessions of which 27160 were used for training. The remaining 6791 sessions were used as a test set. A total of 13
features, both numerical and categorical were extracted from the data (Table 1).
Setup. Standard MNL was implemented using the Larch open toolbox1 (Newman et al., 2016) since it is fast and it allows a variable
number of alternatives. Latent class MNL was implemented using the LCCM Python package2 (El Zarwi, 2017) since it allows a variable
number of alternatives. RF were implemented using the Distributed Random Forests package from the H2O library.3 Training of RF was
performed with the default parameters provided by the library except for the number of trees for which 500 was used, in addition to the
default value of 50. Note that, the number of trees can go as high as desired but at some point the gains are not worth the additional
computing time (Breiman, 2001).
Metrics. The selection of metrics was driven by the industrial uses of these choice models. We consider the Top-N accuracy which is
computed as follows. The alternatives are ranked by assigned probability (or score in the case of RF) and ties are broken randomly. A
prediction is considered Top-N accurate if the ranking of the chosen alternative, is less or equal than N.
It should be noted that for applications such as dynamic pricing of flight tickets, a small difference in Top-1 and Top-5 prediction
accuracy can lead to a significant increase in profit (Delahaye et al., 2017). Also Top-15 has a particular importance for ranking the
results of flight searches since most websites show approximately 15 results per page, and users usually look at the first page in more
detail.
The Top-N accuracy was used to compare MNL and RF methods, as well as trivial model based on uniform probability assignment, i.e.
b unif ðCi ¼ jÞ ¼ 1 , that was used as reference.
P ni

5.2. Experiment 1: alternative specific features only

In this experiment, we use only alternative specific features (see Table 1) to show how the flexibility of tree models allow to capture
non-linear dependencies.
For RF, we make some attributes relative by doing the following transformations:

 ratioPrice ≡ Price/cheapest Price

1
Available at https://github.com/jpn–/larch.
2
Available at https://github.com/ferasz/LCCM.
3
Available at http://www.h2o.ai/.

202
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

Table 2
MNL improvement with non-linear transformations.

Step Transformation #parameters AIC

1 None 8 138274
2 Price → log 8 138189
Price → sqrt 8 138124
Price → sq 8 139012
3 Trip duration → log 8 137689
Trip duration → sqrt 8 137930
Trip duration → sq 8 138279
4 Stay duration → log 8 137329
Stay duration → sqrt 8 137573
Stay duration → sq 8 137690
5 Arr/Dep Time → cos, sin K ¼ 1 10 134612
Arr/Dep Time → cos, sin K ¼ 2 14 133915
Arr/Dep Time → cos, sin K ¼ 3 18 133559
Arr/Dep Time → cos, sin K ¼ 4 22 133463
Arr/Dep Time → cos, sin K ¼ 5 26 133451
Arr/Dep Time → cos, sin K ¼ 6 30 133412
Arr/Dep Time → cos, sin K ¼ 7 34 133339
Arr/Dep Time → cos, sin K ¼ 8 38 133323
Arr/Dep Time → cos, sin K ¼ 9 42 133324
Arr/Dep Time → cos, sin K¼10 46 133290
Arr/Dep Time → cos, sin K ¼ 11 50 133296
Arr/Dep Time → cos, sin K ¼ 12 54 133294

Table 3
Summary of results using alternative specific features only.

method Top-1 Top-5 Top-15 running time (min:s)

uniform 0.067 0.262 0.567 –


MNL 0.224 0.624 0.894 01:23
RF50 0.250 0.631 0.902 01:20
RF500 0.249 0.643 0.908 15:04

 ratioTripDuration ≡ Trip duration/shortest Trip duration


 ratioStayDuration ≡ Stay duration/shortest Stay duration

Since RF are based on hard partitioning rules, these transformations help RF by putting completely different situations (e.g. different
currencies, long and short haul flights, etc) on the same scale. This also helps when the out-of-sample choice sets are outside the
experience of the training data. For example, if prices rise after a fuel price shock. Additionally, since RF considers each alternative
independently, this transformation injects some session context.
Regarding price, trip and stay durations, for MNL, we tried the raw values and also the common transformations log, square-root and
square. Regarding departure and arrival times, these were decomposed into a sum of sine and cosine functions as suggested in (Ben-
Akiva and Abou-Zeid, 2013) to ensure that every time between 0 and 24 has a unique utility value. Additionally, angles with different
frequencies (f2kπgk¼1…K ) are used in order to improve the model fit with respect to using only one frequency. In order to find the
truncation point K, we add up to 12 frequencies one by one on both Departure and Arrival times and evaluate each model in terms of the
Akaike Information Criterion (AIC) to indirectly estimate the prediction accuracy (see, e.g., [15, Ch.7]). We observe that the model fit
degrades after K ¼ 10. Table 2 shows the performance of the different MNL models in terms of AIC, where we replaced these features by
the transformed ones in a stepwise manner. The transformations that improve the model fit are kept for the next step. In Appendix A,
Table A.6 shows the coefficients of the estimated model and the usual fitness metrics.
Then, we compare in terms of our metrics, the MNL model with the non-linear transformations to the RF model. Table 3 summarizes
the results along with running time for model fitting.4 For MNL, we show only the time to estimate the final model. We observe that RF
achieves an improvement of 2.6 p. p. with respect to MNL in Top-1 accuracy. In this experiment, when we compare to RF50, we observe
that RF500 achieves a slightly worse Top-1 accuracy (-0.04 p. p.) and a slightly better Top-5 and Top-15 accuracies (þ1.2 p. p. and þ0.7
p. p. respectively) for an order of magnitude increase in running time. This experiment shows the ability of RF to automatically model
non-linear relationships whereas for MNL, this requires more work and expert knowledge.
The feature importance analysis shown in Fig. 1 yields that the most important feature is ratioPrice, which reflects price sensi-
tiveness. In addition, RF gives also high importance to ratioStayDuration, ratioTripDuration and departure and arrival times, which
suggests that the model is also capturing convenience preferences.

4
Both libraries used two threads on a standard personal computer (Intel(R) Core(TM) i5-5300U CPU @ 2.30 GHz with two cores and four logical processors).

203
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

Fig. 1. RF feature importance using alternative specific features only.

Table 4
LC-MNL improvement with non-linear transformations.

Transformation AIC

None 127035
DTD → log 127074
DTD → sqrt 127064

5.3. Experiment 2: individual heterogeneity

In this experiment, we exploit attributes that help to characterize the type of user, more precisely the purpose of the trip, i.e., business
or leisure. As suggested in (Delahaye et al., 2017), the purchase anticipation (Days to Departure, DTD), the stay duration and the in-
clusion of a Saturday are good predictors for the purpose of the trip. Since these attributes can slightly vary within a choice set, we use
their median value to characterize each individual. In addition, the Borenstein Travel Index (Borenstein, 2010) suggests that the Origin/
Destination information can help in identifying the purpose of the trip.
Therefore, we consider latent class MNL (LC-MNL) with two classes. We used the attributes DTD, Stay Duration, Stay Saturday,
Origin and Destination for the class membership model. Categorical variables Origin and Destination were dummy coded. For the
alternative specific features, we used the same transformations as in Experiment 1. For DTD, we consider log and sqrt transformations
but the raw attribute gave a better AIC as shown in Table 4. In Appendix A, Tables A.7, A.8 and A.9 show the coefficients of the estimated
model and the usual fitness metrics.
Table 5 shows the results in terms of prediction on the test set. As expected, LC-MNL yields much better results than MNL in terms of
Top-N accuracy thanks to its ability to represent individual heterogeneity. RF50 and RF500 yield comparable results for a much shorter
running time.5
With respect to the number of classes, q ¼ 2 was a natural choice from the business/leisure split point of view. Nevertheless, we also
tried with three, four and five classes. Three classes yielded a significantly worse fit (AIC ¼ 130549) and a worse TOP-1 accuracy
(0.234). In fact, the increase in AIC is not only explained by the higher number of parameters but also by a worse likelihood, indicating
a convergence problem of the optimization algorithm. Actually, the LCCM library relies on the Expectation Maximization algorithm
which is prone to converging to local optima and even to saddle points (see, e.g., (McLachlan and Krishnan, 2007, Sec.3.6)). Four and
five classes yielded a slightly better fit (AIC ¼ 126235 and 126496 respectively) but the prediction accuracy on the test set was in fact
worse (0.264 and 0.262 for Top-1 accuracy, respectively). In Table 5, we show the results obtained with two classes.
The feature importance analysis of RF shows that the combined importance of Origin and Destination is above all the other variables’

5
This experiment was run on an Intel(R) Xeon(R) CPU E5-2643 v4 @ 3.40 GHz with 24 cores.

204
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

Table 5
Summary of results using alternative and individual specific variables obtained from the raw features of Table 1. We include the result of MNL from Experiment 1 (with
only alternative specific variables) for reference purposes.

Method Top-1 Top-5 Top-15 running time (h:min:s)

uniform 0.067 0.262 0.567 –


MNL 0.224 0.624 0.894 –
LC-MNL 0.270 0.672 0.921 01:22:12
RF50 0.267 0.657 0.916 00:00:22
RF500 0.273 0.674 0.924 00:03:29

Fig. 2. RF feature importance using alternative specific features, Business/Leisure predictors and Origin/Destination.

importances, showing the heterogeneity of the geographic markets (see Fig. 2).
This experiment shows the ability of RF to capture segments in automated way in contrast to LC-MNL which requires more effort and
expert knowledge. Moreover, the better fit obtained with LC-MNL using a higher number of classes suggests that it could benefit from
more segmentation if more data were available. Nevertheless, a whole set of parameters is added for each new class which makes the
generalization difficult for the available dataset. Machine learning methods like RF are more flexible in finding a good trade-off between
model complexity and generalization.

6. Conclusions and future work

In this paper we dealt with the air itinerary choice modeling problem with a special focus on out-of-the-sample prediction appli-
cations. Most of the solutions found in the literature are based on discrete choice modeling, in particular MNL models and their variants.
MNL-based models have many advantages, in particular the model interpretability for behavior analysis. Nevertheless they lack flex-
ibility to handle collinear attributes and correlations between alternatives. Moreover, they require expert knowledge to introduce non-
linearity in the effect of alternatives’ attributes and to model individual heterogeneity.
We proposed an alternative approach based on machine learning (ML) algorithms that have been shown to be effective on prediction
tasks in many different fields. The main advantages of the proposed machine learning approach for this particular application are its
capability to automatically segment the travelers and to take into account non-linear relationships within attributes of alternatives and
characteristics of the decision maker.
We compared our machine learning approach (using Random Forests) to simple and latent class MNL models on two experiments: a)
using alternative features only, and b) including attributes to model individual heterogeneity. The data used for the experiments has
been extracted from travel industrial processes and it contains a particular challenging heterogeneity. It is composed of round-trips
alternatives, multiple markets, different traveler profiles, and different distribution channels. The results showed that the ML
method performed better in terms of prediction accuracy and computation time. The inclusion of individual heterogeneity attributes and

205
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

the two classes latent model improved the performance of MNL however the computational cost increased significantly. In addition to
this, MNL models require significantly more effort than ML to find good model specifications while keeping the model complexity under
control.
When it comes to future works, we foresee two lines of research. On the theoretical side, it would be interesting to understand the
consequences of the violation of the i.i.d. assumption. For example, we plan to investigate machine learning methods that consider each
choice set as one sample point. On the applied side, we plan to leverage the availability of large scale data platforms to explore even
larger datasets collected by Global Distribution Systems and take advantage of production-ready distributed implementations of ma-
chine learning algorithms.

Acknowledgements

The authors would like to thank their colleagues Maria A. Zuluaga, Alejandro Mottini and Eoin Thomas for providing valuable input
and discussion during this work.

Appendix A. Estimated MNL models


Table A.6
MNL model. Fitted Log-Likelihood: -66599.22. Null Log-Likelihood: -87984.44. Rho Squared: 0.243.

Variable Parameter StdError t-Stat

sqrtPrice 0.358 0.00595 60.1


logTripDuration 5.2 0.232 22.4
logStayDuration 1.28 0.0446 28.7
containsLCC1 0.652 0.0211 30.8
directTRUE 0.51 0.164 3.11
interlineTRUE 1.42 0.0466 30.5
outDepTimeCos2pi 52.4 10.9 4.82
outDepTimeSin2pi 24 5.05 4.76
outArrTimeSin2pi 48.8 36.8 1.33
outArrTimeCos2pi 54.7 35.5 1.54
outDepTimeCos4pi 33.4 7.01 4.77
outDepTimeSin4pi 39.4 8.2 4.81
outArrTimeSin4pi 64.4 45.2 1.42
outArrTimeCos4pi 8.74 4.29 2.04
outDepTimeCos6pi 11.5 2.79 4.12
outDepTimeSin6pi 41.5 8.59 4.83
outArrTimeSin6pi 44.6 24.7 1.81
outArrTimeCos6pi 29.4 27.7 1.06
outDepTimeCos8pi 5.05 1.86 2.71
outDepTimeSin8pi 32.9 6.75 4.87
outArrTimeSin8pi 10.9 5.46 2
outArrTimeCos8pi 39.1 27.2 1.44
outDepTimeCos10pi 12.8 2.9 4.42
outDepTimeSin10pi 19.8 4.04 4.89
outArrTimeSin10pi 12.2 14.7 0.83
outArrTimeCos10pi 25.4 11.7 2.17
outDepTimeCos12pi 12.7 2.65 4.78
outDepTimeSin12pi 8.17 1.82 4.5
outArrTimeSin12pi 16 10.8 1.49
outArrTimeCos12pi 7 3.57 1.96
outDepTimeCos14pi 8.37 1.69 4.95
outDepTimeSin14pi 1.28 0.732 1.75
outArrTimeSin14pi 9.13 3.44 2.65
outArrTimeCos14pi 2.88 4.86 0.592
outDepTimeCos16pi 3.94 0.787 5
outDepTimeSin16pi 0.959 0.456 2.1
outArrTimeSin16pi 2.28 1.19 1.92
outArrTimeCos16pi 3.76 2.34 1.61
outDepTimeCos18pi 1.23 0.264 4.64
outDepTimeSin18pi 0.968 0.248 3.9
outArrTimeSin18pi 0.288 0.751 0.384
outArrTimeCos18pi 1.6 0.495 3.23
outDepTimeCos20pi 0.232 0.0653 3.55
outDepTimeSin20pi 0.328 0.0792 4.14
outArrTimeSin20pi 0.363 0.158 2.31
outArrTimeCos20pi 0.252 0.122 2.06

206
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

Table A.7
LC-MNL: Class 2 membership. Fitted Log-Likelihood: -63379.69. Null Log-Likelihood: -87984.44. Rho Squared: 0.28.

Variable Parameter StdError t-Stat

intercept 4.5976 0.7231 6.3586


logDtd 0.3671 0.0446 8.2262
medStayDurationMinutes 0.0008 0.0000 27.2635
StaySaturday1 1.6085 0.1223 13.1497
Origin1 10.4793 651.0901 0.0161
Origin2 1.6054 0.5282 3.0392
Origin3 0.8783 0.5289 1.6605
Origin4 0.5994 0.6664 0.8995
Origin5 1.4750 0.5444 2.7095
Origin6 0.8898 0.6085 1.4622
Origin7 1.2099 0.5232 2.3126
Origin8 0.9132 0.5344 1.7089
Origin9 1.3479 0.5729 2.3529
Origin10 0.2530 0.6640 0.3810
Origin11 0.7471 0.7622 0.9802
Origin12 0.7358 0.5740 1.2818
Origin13 1.1952 0.5349 2.2346
Origin14 2.7984 0.6040 4.6335
Origin15 2.5744 0.5855 4.3970
Destination1 1.7080 1.1175 1.5285
Destination2 2.5375 0.4810 5.2752
Destination3 0.7174 0.7002 1.0247
Destination4 0.1248 0.5836 0.2138
Destination5 1.5491 0.8072 1.9191
Destination6 0.4966 0.5037 0.9859
Destination7 0.2122 0.4868 0.4359
Destination8 1.9274 1.1147 1.7291
Destination9 1.2219 0.8804 1.3879
Destination10 0.4443 0.4761 0.9332
Destination11 0.0144 4.3540 0.0033
Destination12 1.3746 0.5578 2.4645
Destination13 1.4798 0.8276 1.7881
Destination14 0.0601 0.5902 0.1019
Destination15 1.0445 0.6356 1.6434
Destination16 0.5341 0.4675 1.1423
Destination17 0.6458 0.5103 1.2656
Destination18 1.7808 0.6156 2.8927
Destination19 0.5686 0.5020 1.1326
Destination20 0.2157 0.4840 0.4457
Destination21 0.8453 0.6230 1.3568
Destination22 0.7230 0.6710 1.0774
Destination23 1.6551 0.5625 2.9425
Destination24 0.0707 0.5694 0.1242
Destination25 0.1883 0.4675 0.4028
Destination26 0.5216 0.5060 1.0307
Destination27 0.9920 0.4932 2.0114

Table A.8
LC-MNL: Class 1 model

Variable Parameter StdError t-Stat

logTotalTripDurationMinutes 5.3247 0.0447 119.0598


logStayDurationMinutes 1.1108 0.0141 79.0389
sqrtPrice 1.2825 0.0025 518.7190
containsLCC1 0.8868 0.0262 33.8291
directTRUE 0.0812 0.0850 0.9562
interlineTRUE 0.9669 0.0496 19.4767
outDepTimeCos2pi 25.8820 14.5159 1.7830
outDepTimeSin2pi 11.6728 6.5830 1.7732
outArrTimeCos2pi 20.6076 11.2189 1.8369
outArrTimeSin2pi 10.3110 11.6025 0.8887
outDepTimeCos4pi 16.2287 9.5200 1.7047
outDepTimeSin4pi 19.0014 10.7284 1.7711
outArrTimeCos4pi 14.4437 5.2677 2.7419
outArrTimeSin4pi 18.2048 14.0058 1.2998
outDepTimeCos6pi 5.8651 3.9863 1.4713
outDepTimeSin6pi 19.8665 11.3112 1.7564
outArrTimeCos6pi 2.6254 9.6134 0.2731
(continued on next page)

207
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

Table A.8 (continued )

Variable Parameter StdError t-Stat

outArrTimeSin6pi 22.8180 8.8861 2.5678


outDepTimeCos8pi 2.1274 2.3435 0.9078
outDepTimeSin8pi 15.7036 8.9755 1.7496
outArrTimeCos8pi 10.0483 8.3929 1.1972
outArrTimeSin8pi 17.9706 6.6278 2.7114
outDepTimeCos10pi 5.7379 3.6617 1.5670
outDepTimeSin10pi 9.2304 5.4722 1.6868
outArrTimeCos10pi 15.2607 5.5308 2.7592
outArrTimeSin10pi 6.2129 5.8335 1.0650
outDepTimeCos12pi 5.4116 3.4052 1.5892
outDepTimeSin12pi 3.8417 2.5243 1.5219
outArrTimeCos12pi 11.0452 4.2424 2.6035
outArrTimeSin12pi 3.6800 3.4323 1.0722
outDepTimeCos14pi 3.4283 2.2077 1.5529
outDepTimeSin14pi 0.8818 1.0107 0.8725
outArrTimeCos14pi 3.4897 2.2329 1.5629
outArrTimeSin14pi 6.1181 2.3790 2.5717
outDepTimeCos16pi 1.6011 1.0421 1.5364
outDepTimeSin16pi 0.0832 0.5942 0.1401
outArrTimeCos16pi 0.8516 0.9032 0.9428
outArrTimeSin16pi 3.3444 1.3834 2.4176
outDepTimeCos18pi 0.5624 0.3529 1.5937
outDepTimeSin18pi 0.2364 0.3254 0.7264
outArrTimeCos18pi 1.0793 0.5176 2.0854
outArrTimeSin18pi 0.6451 0.4112 1.5690
outDepTimeCos20pi 0.2033 0.0865 2.3511
outDepTimeSin20pi 0.0896 0.1054 0.8496
outArrTimeCos20pi 0.2607 0.1406 1.8534
outArrTimeSin20pi 0.0551 0.1079 0.5102

Table A.9
LC-MNL: Class 2 model

Variable Parameter StdError t-Stat

logTotalTripDurationMinutes 8.5385 0.0571 149.4560


logStayDurationMinutes 0.6965 0.0159 43.6904
sqrtPrice 0.1040 0.0023 45.0124
containsLCC1 0.6035 0.0236 25.6196
directTRUE 2.0611 0.2913 7.0745
interlineTRUE 3.0718 0.1866 16.4649
outDepTimeCos2pi 319.6435 67.9504 4.7041
outDepTimeSin2pi 197.3289 41.4659 4.7588
outArrTimeCos2pi 116.2570 51.0394 2.2778
outArrTimeSin2pi 152.2132 61.5703 2.4722
outDepTimeCos4pi 148.2626 32.8547 4.5127
outDepTimeSin4pi 296.4367 62.3064 4.7577
outArrTimeCos4pi 42.4980 23.3625 1.8191
outArrTimeSin4pi 163.5791 67.8071 2.4124
outDepTimeCos6pi 23.7819 9.2469 2.5719
outDepTimeSin6pi 266.6403 56.4970 4.7195
outArrTimeCos6pi 125.4599 49.3810 2.5407
outArrTimeSin6pi 54.4129 33.1342 1.6422
outDepTimeCos8pi 116.8705 24.7212 4.7275
outDepTimeSin8pi 158.8084 34.5836 4.5920
outArrTimeCos8pi 89.2972 37.3439 2.3912
outArrTimeSin8pi 47.3412 25.4809 1.8579
outDepTimeCos10pi 120.8874 25.3560 4.7676
outDepTimeSin10pi 50.4835 13.1911 3.8271
outArrTimeCos10pi 12.7702 18.9386 0.6743
outArrTimeSin10pi 65.8821 24.7066 2.6666
outDepTimeCos12pi 76.6313 16.3116 4.6980
outDepTimeSin12pi 10.5949 5.3524 1.9795
outArrTimeCos12pi 24.8588 12.9601 1.9181
outArrTimeSin12pi 29.9126 14.0137 2.1345
outDepTimeCos14pi 31.1855 7.1017 4.3913
outDepTimeSin14pi 24.3904 5.5439 4.3995
outArrTimeCos14pi 19.3629 7.2001 2.6893
outArrTimeSin14pi 0.0555 7.1767 0.0077
outDepTimeCos16pi 6.5120 2.1241 3.0658
outDepTimeSin16pi 15.4930 3.3720 4.5946
(continued on next page)

208
A. Lheritier et al. Journal of Choice Modelling 31 (2019) 198–209

Table A.9 (continued )

Variable Parameter StdError t-Stat

outArrTimeCos16pi 4.9440 3.2737 1.5102


outArrTimeSin16pi 5.9479 3.1014 1.9178
outDepTimeCos18pi 0.5187 0.6286 0.8252
outDepTimeSin18pi 5.4713 1.2239 4.4703
outArrTimeCos18pi 0.3314 1.1695 0.2834
outArrTimeSin18pi 2.2054 1.0080 2.1879
outDepTimeCos20pi 0.6168 0.1947 3.1684
outDepTimeSin20pi 0.9543 0.2480 3.8478
outArrTimeCos20pi 0.2469 0.2136 1.1561
outArrTimeSin20pi 0.2563 0.2587 0.9909

References

Ben-Akiva, M., Abou-Zeid, M., 2013. Methodological issues in modelling time-of-travel preferences. Transportmetrica: Transport Science 9 (9), 846–859.
Bhadra, Dipasis, Hogan, Brendan, 2005. US airline network: A framework of analysis and some preliminary results. In: 46th Annual Transportation Research Forum.
Transportation Research Forum, Washington, D.C.. March 6-8, 2005, 208186.
Borenstein, Severin, 2010. An Index of Inter-City Business Travel for use in Domestic Airline Competition Analysis. UC Berkeley. Index available at:http://www.nber.
org/data/bti.html. (Accessed 19 January 2018).
Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32.
Carrier, E., 2008. Modeling the Choice of an Airline Itinerary and Fare Product Using Booking and Seat Availability Data. Massachusetts Institute of Technology,
Department of Civil and Environmental Engineering. Ph.D. thesis.
Caruana, R., Karampatziakis, N., Yessenalina, A., 2008. An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International
Conference on Machine Learning. ACM, pp. 96–103.
Coldren, G.M., Koppelman, F.S., 2005. Modeling the competition among air-travel itinerary shares: gev model development. Transport. Res. Pol. Pract. 39 (4), 345–365.
Coldren, G., Koppelman, F., 2005. Modeling the proximate covariance property of air travel itineraries along the time-of-day dimension, Transportation Research
Record. Journal of the Transportation Research Board 1915, 112–123.
Coldren, G.M., Koppelman, F.S., Kasturirangan, K., Mukherjee, A., 2003. Modeling aggregate air-travel itinerary shares: logit model development at a major us airline. J.
Air Transport. Manag. 9 (6), 361–369.
Delahaye, T., Acuna-Agost, R., Bondoux, N., Nguyen, A.-Q., Boudia, M., 2017. Data-driven models for itinerary preferences of air travelers and application for dynamic
pricing optimization. J. Revenue Pricing Manag. 16 (6), 621–639.
El Zarwi, F., 2017. Modeling and Forecasting the Impact of Major Technological and Infrastructural Changes on Travel Demand. University of California, Berkeley (.
Ph.D. thesis.
Greene, W.H., Hensher, D.A., 2003. A latent class model for discrete choice analysis: contrasts with mixed logit. Transp. Res. Part B Methodol. 37 (8), 681–698.
Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction, second ed. Springer. URL. http://www-stat.
stanford.edu/~tibs/ElemStatLearn/.
Lurkin, Virginie, 2017. Modeling in air transportation: cargo loading and itinerary choice. 4OR 15, 107–108.
Lurkin, V., Garrow, L.A., Higgins, M.J., Newman, J.P., Schyns, M., 2017. Accounting for price endogeneity in airline itinerary choice models: an application to con-
tinental us markets. Transport. Res. Pol. Pract. 100, 228–246.
McFadden, D., 1973. Conditional logit analysis of qualitative choice behaviour. In: Zarembka, P. (Ed.), Frontiers in Econometrics. Academic Press New York, New York,
NY, USA, pp. 105–142.
McLachlan, G., Krishnan, T., 2007. The EM Algorithm and Extensions, vol 382. John Wiley & Sons.
Newman, J.P., Lurkin, V., Garrow, L.A., 2016. Computational methods for estimating multinomial, nested, and cross-nested logit models that account for semi-aggregate
data. In: 96th Annual Meeting of the Transportation Research Board. Washington, DC.
Parker, R.A., Lonsdale, R., Ervin, F., Zhang, Z., 2005. The boeing global mark et al location system. In: Agifors Symposium.
Proussaloglou, K., Koppelman, F.S., 1999. The choice of air Carrier, flight, and fare class. J. Air Transport. Manag. 5 (4), 193–201.

209

You might also like