You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325964700

A Comparative Study of Machine Learning Techniques for Aviation Applications

Conference Paper · June 2018


DOI: 10.2514/6.2018-3980

CITATIONS READS
18 5,353

3 authors, including:

Apoorv Maheshwari Navindran Davendralingam


Quantitative Scientific Solutions LLC Amazon
28 PUBLICATIONS   109 CITATIONS    41 PUBLICATIONS   198 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Acquisition Research Program - Naval Post Graduate School View project

Learn to Gamebreak (L2G) View project

All content following this page was uploaded by Apoorv Maheshwari on 05 July 2018.

The user has requested enhancement of the downloaded file.


AIAA AVIATION Forum 10.2514/6.2018-3980
June 25-29, 2018, Atlanta, Georgia
2018 Aviation Technology, Integration, and Operations Conference

A Comparative Study of Machine Learning Techniques for


Aviation Applications

Apoorv Maheshwari∗ , Navindran Davendralingam† , and Daniel A. DeLaurentis‡


Purdue University, West Lafayette, IN, 47907

Machine learning is becoming a very popular way to find patterns in complex data. With
the advancements in the storage and computational capabilities, a lot of machine learning tech-
niques are becoming suitable for real-world applications. In author’s canvassing of literature,
the adoption of machine learning techniques in the aviation community is low as compared to
other communities, predominantly due to unavailability of access to high-quality data and high
reliance on simple, easily interpretable models as compared to complex predictive models. In
addition to these, the taxonomy differences between computer science and aviation community
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

also makes the adoption difficult. In this paper, we perform a comparative study of popular
supervised machine learning techniques for aviation problems using an air travel demand
modeling problem as an example. We implement Classification and Regression Trees, Support
Vector Machines, Neural Networks, and Ensemble Methods on the air travel demand estima-
tion and forecasting problems. With the help of this work, we plan to provide a qualitative
comparison of these techniques to serve as a guideline on choosing a suitable algorithm for a
given problem.

I. Introduction
achine learning is a field of computer science that focuses on algorithms that can learn from and perform predictive
M analysis on data. The idea of using data to create a model is not new and has always been the basis of scientific
research, especially, experimental research work. The storage and computational capacity has developed rapidly in
recent years, making a lot of techniques that were proposed in the early 19th century (or even earlier) suitable for
real-world applications. With recent advancements in the data processing frameworks, cloud services, tool/libraries
to implement a variety of algorithms, the effort to implement any new technique on a provided dataset has reduced
considerably. In fact, most of the commonly used machine learning tools are developed with a special focus on ease of
use across different communities. We have already started to experience machine learning in our daily routine in the
form of smart assistants in our mobile phones, spam filtering in our emails, computer vision technologies in our cars,
preventive health care through smart medication/appointment reminders, text-to-speech natural language processing and
so on.
As machine learning starts to gain prominence in all these areas, there is a need to systematically start using this
knowledge for aviation problems. One of the biggest benefits of using machine learning techniques over the conventional
statistical approaches is that the model form is mostly driven by the dataset rather than initial assumptions placed upon
the form of the model. For example, in conventional statistical approaches, the data is usually fit to an assumed form of a
mathematical model, enforcing natural restrictions on the effectiveness of the model due to unknown dynamics of what
generates the data. Whereas in machine learning, fewer restrictions are placed and thus, allows for more flexibility in the
modeling. The flip side is that these machine learning techniques, typically, require a huge dataset and computation
capability to create the model. Moreover, the better performance of these techniques comes at a price of increase in the
complexity which makes it difficult to interpret the models in comparison to the conventional statistical approaches.
Researchers are gradually beginning to implement machine learning techniques to aviation problems but the adoption
is not fast enough due to unavailability of high-quality data that is typically held private by large industrial organizations,
and high reliance on the experts opinion rather than data-driven complex models. Smith et al. [1] discuss different
technology forecasting techniques for complex systems and identified machine learning useful for providing estimates
for future technology predictions. Lee et al. [2] used American Airlines provided data to compare various machine
learning techniques to predict taxi-out time at Charlotte Airport. They found that the machine learning algorithms
∗ Graduate Research Assistant, Aeronautics and Astronautics, 701 W. Stadium Ave., AIAA Student Member
† Research Scientist, Aeronautics and Astronautics, 701 W. Stadium Ave., AIAA Member
‡ Professor, Aeronautics and Astronautics, 701 W. Stadium Ave., AIAA Associate Fellow

Copyright © 2018 by Apoorv Maheshwari; Navindran Davendralingam; Daniel DeLaurentis.


Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.
are able to predict taxi time for 65 − 74% departures within a 5 minute window and the accuracy can be improved
with a more detailed dataset. Ukai et al. [3] use machine learning, specifically, neural network techniques to create
an aircraft deployment prediction model. They found these models to provide good accuracy and robustness for this
application but lacked the real-world data to validate the results. Kotegawa et al. [4] compares various machine learning
approaches to forecast restructuring of US air transportation network and found the performance of artificial neural
networks significantly better than existing forecasting models. Text mining approaches are also gaining popularity in
analyzing accident reports to improve our understanding of the safety-related incidents. [5–8]
In this paper, we will focus on modeling air travel demand for a representative city pair. The demand modeling
can be classified into two types of problems: Estimation and Forecasting. Estimation models aim to quantifying the
links between the level of demand and the variables which determine it. On the other hand, forecasting models aim at
predicting the future level of demand based upon the time-ordered sequence of historical observations on a variable. We
will consider both types of problems in our analysis.
Since the problem of estimating air travel demand is intimately related to policy and profitability, there have been
considerable quantitative modeling efforts in this domain. One of the earliest attempts in the literature was made by
Taneja [9], where measures of socio-economic characteristics of airline passengers and transport related features were
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

employed to model and forecast total transatlantic air travel demand. A more recent attempt using regression tools was
made by Bafail et al.[10]; this reference provides insights into the selection of process of explanatory variables, and the
potential problem that arises with when there is multi-collinearity between predictors. These studies serve as good
starting points for us to select appropriate features for our machine learning study. In the case study, we model the air
travel demand between any two cities, based on the socio-technical factors, using machine learning techniques. To
reduce the complexity of the example problem, we will restrict our analysis to top 30 airports (enplanement-wise) of the
US domestic air transportation network. We will also restrict our feature space to publicly available data in this analysis
so that the policy makers who might not have access to the proprietary information of various service providers (such as
airlines), can also make use of these models for future planning.
The overarching research objective of our work is to compare various machine learning techniques for aviation
problems with the help of an air travel demand modeling problem as an example.
In section II, we give a quick introduction to different categories of machine learning approaches that are discussed in
this paper. In section III, we describe the problem setup and the data used for the analysis. In section IV, implementation
of different machine learning algorithms is discussed along with the comparison of the algorithms and the V provides a
summary of our work and future vision.

II. Brief Introduction to Relevant Machine Learning Techniques


The machine learning techniques can broadly be classified into following three categories, based upon the nature of
the learning [11].
• Supervised Learning: In this category, the machine is presented with inputs and their desired outputs, by a
teacher. The goal of the algorithms in this category is to create a mapping from inputs to outputs.
• Unsupervised Learning: In this category, no desired outputs are provided. The goal of the algorithms is to find
structure or pattern in the given inputs.
• Reinforcement Learning: In this category, the machine is just provided with a set of rules to interact with the
environment. The machine is also provided feedback based upon its interaction. The machine is typically tasked
with finding an strategy in the algorithms from this category.
In this paper, we will restrict our study to the supervised learning algorithms and will implement those on the
demand modeling problem.

A. Classification of Supervised Learning Techniques


Based upon the type of the desired output, the supervised learning techniques can be further classified into two
major categories:
• Regression: In regression, the desired output variable is continuous in nature. This is typically used to get a point
estimate of the output (or target) variable for given input values. For the demand modeling problems, a regression
algorithm will provide an estimate of the air travel demand between two cities based upon the given inputs.
• Classification: In classification, the desired output is the category of a given input data based upon the classes
defined using the previously provided labeled data. For example, in case of demand modeling, the problem can

2
be formulated with an objective to estimate the level of demand (such as very low-low-medium-high-very high)
instead of getting an exact numerical value.

B. Process of Supervised Machine Learning


The process of applying supervised machine learning to a real-world problem is shown in the Figure 1. The first
step is collecting the relevant dataset. Usually, a subject matter expert identifies the features relevant to the problem.
The next step in the process is to clean the dataset to remove noise and handle missing data [12–14]. Another important
strategy at this stage is to identify the set of orthogonal features. This not only minimizes the number of redundant
features in the set but also enables the algorithms to run faster due to reduced dimensionality of the dataset [15, 16].
The selection of which algorithm to use is a crucial
step. Different suitable algorithms are usually employed,
tuned and evaluated to identify the best performing algo-
rithm for the problem. A common method to evaluate any
algorithm is to divide the dataset into two sets, namely,
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

training (∼ 80% typically) and testing data (∼ 20% typ-


ically). The training data is used to train and build the
model and the model is evaluated by its performance on
the testing data. Another popular evaluation technique,
known as, k-fold cross validation, is used when we have
dataset of sufficient size. In this approach, the origi-
nal dataset is randomly partitioned into k equally sized
datasets. Out of these k subsamples, a model is trained
using k − 1 subsamples and tested upon the remaining 1
subsample. This process is repeated until all subsamples
have been used as the test data. The average performance
of an algorithm, along with the variation, is then reported
and used for comparison with other algorithms.
As can be seen in the Figure 1, if the performance
of the model is not satisfactory, we must return to pre-
vious stages to examine what might be going wrong. A
number of issues are possible including (but not limited
to) features not being identified properly, data not being
cleaned appropriately, unsuitable algorithm implemented,
algorithm parameters not tuned and so on.
Supervised machine learning is one of the most pop-
ular areas of machine learning and thus, there are a large
number of techniques developed that lie in this area. In the
next subsection, we focus on the most relevant machine
learning techniques, coming from a diverse theoretical
background, that we will implement on the demand mod-
eling problem.

C. Relevant Machine Learning Techniques

1. Classification and Regression Trees (CART)


These approaches, first introduced by Breinman et al. Fig. 1 The process of Supervised Machine Learning
[18], produce a decision tree to predict the output value (adapted from [17])
based upon the given inputs. Based upon the type of the
output variable, the trees are classified into classification trees (for discrete output variable) and regression trees (for
continuous output variable). In a decision tree, each interior node represents an input variable and edges (or branches) are
created based upon different possible values of the input variable. Each leaf corresponds to a value of the target variable
based upon the values of the input variables represented by the path from root to leaf. An example of classification
decision tree is shown in the figure 2 describing the survival of passengers on the Titanic.

3
A typical decision tree algorithm will work top-down and will choose the variable at each step by maximizing the
split between the relevant dataset. Different algorithms use different metrics to quantify the split. Some of the most
commonly used metrics are Gini impurity [18] and information gain [19].
Decision Trees are preferred in the machine learning community as they often provide a fast and an easy to understand
representation of the data. Decision tree models are very useful when modeling human decisions and behavior.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

Fig. 2 Decision tree for survival on Titanic (Values under the leaves show the probability of survival and the
percentage of observations in the leaf)

2. Support Vector Machines (SVM)


A support vector machine [20] aims at finding the classifier (represented as a hyperplane or a set of hyperplanes)
that best separate the given dataset. Let’s consider a classification problem where output variable has two possible
values (or categories): +1 and −1.If the training data is linearly separable, i.e., a pair of (w, b) exists such that

wT xi + b ≥ 1 ∀yi = 1
w xi + b ≤ −1
T
∀yi = −1

In this case, the decision rule is fw,b (x) = sgn(wT x + b) where w is the weight vector and b is the bias. In other words,
for linearly separable data, we can find two parallel hyperplanes that separate the two classes of data such that the
distance between them is maximized. We define the region between these two hyperplanes as margin and thus, the
maximum-margin hyperplane lies at an equal distance from both of the previously identified hyperplanes.
Thus identifying the maximum-margin hyperplane (when data is linearly separable), can be formulated as an
optimization problem:

Minimize ||w||
subject to yi (wT xi + b) ≥ 1, for i = 1, ..., n

When data is not linearly separable, Veropoulos et al.[21] proposed the concept of soft margin by introducing positive
slack variables ξi = max(0, 1 − yi (wT xi + b)), i = 1, ..., N in the constraints. The modified optimization problem is:

n

Minimize λ||w|| 2 + ξi
n i=1
subject to yi (wT xi + b) ≥ 1, for i = 1, ..., n
ξi ≥ 0

4
where λ determines the trade-off between the current prediction and size of the margin. In real-world applications, the
dataset is rarely found to be linearly separable and a popular approach is to map the data to a higher-dimensional space
(computationally enabled by kernel functions [22]) and define the separating hyperplane there.
Crammer and Singer [23] proposed a variation of this binary classification problem to extend it to multiclass problem.
Similarly, Drucker et al. [24] provided a formulation to extend the classification problem to regression problem. The
regression approach is commonly referred to as Support Vector Regression (SVR).
Once the separating hyperplane is identified, most of the data other than the points closest to the margin (known as
the support vectors) becomes redundant. Thus, SVM provides a great way to generalize the given dataset and usually,
robust against small changes in the input dataset. Due to these advantages, SVM is very popular in text categorization
and recognizing hand-written characters.

3. Neural Networks
Neural networks (or deep learning, as it is more recently called) is a technique inspired by how neurons work in
biological systems (i.e. the brain). The idea is to form a mathematical representation that is similar to how interconnected
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

neurons in biological systems are wired to fire and produce an output, given some form of an stimulus signal. These
mathematical functions are expressed as nodes, and are interconnected to form a complex web of inputs and outputs
among and between said nodes; the sequence, types of functions used and connectivity between the neural network’s
nodes dictate the effectiveness in dealing with various forms of machine learning problems. Figure3 below shows a
notional neural network setup where input nodes denote the inputs variables, the hidden nodes represent intermediate
values calculated based on selected activation functions of the chosen type of neural network setup and weighted inputs,
and, finally the outputs nodes that can be predicted outcome values.The neural network learns a set of weighting factors
to improve the mapping of input data to some desired (observed) output data.

Fig. 3 Notional neural network architecture

While neural networks have been around since the 1940s, their use and popularity diminished until recently. The
resurgence in the use, research and development of neural networks and machine learning in general, has been fueled by
a range of factors that includes: advances in nonlinear optimization techniques, computational hardware and software
improvements, and, voluminous data to name a few. These, among other factors, have led to increased application of
neural networks in complex applications, resulting in increased accuracy [25].

4. Ensemble Methods
Ensemble methods are meta-algorithms that extend machine learning techniques to include consideration for
weighting different possible models used in predictions. In a sense, the predicted outcome can be generated from a
weighted (or other forms of aggregation) set of outcomes from multiple models that either use input data wholly, or, in
part for each constituent model. Examples of ensemble based methods include bootstrapping[18], bagging (also known
as bootstrap-aggregating), and stacking[26]. Figure4highlights an example architecture for an ensemble method where
the a set of data is used by multiple models to generate individual multiple predictions. In this case, we take the example
of decision trees (e.g. Classification and Regression Trees as explained earlier) where different tree structures may be

5
formed, and different sample of subsets of the data used. In this case, the use of a Bagging ensemble approach such as
Random Forests to generate different tree structures and consequently, different predictions, generally results in good
prediction performance when the individual model predictions are aggregated. The intuition behind ensemble method
approaches is that typical model errors come from a trade-off between model bias and variance. In the case of ensemble
methods, the idea is to aggregate away sources of these errors across the models used, as these sources of error can only
be traded off within any individual model.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

Fig. 4 Notional ensemble method architecture

III. Problem Description


We use the demand modeling problem to compare the aforementioned algorithms. The demand modeling problem
is one of the most studied problems in the aviation community. Moreover, as predictive modeling is one of the most
promising aspects of the machine learning, this example will provide a good basis for future use of machine learning
in aviation problems. As mentioned earlier, in this paper, we will restrict our analysis to top 30 airports of the US
domestic air transportation network. Moreover, we will only consider the use of publicly available data to prepare
models suitable for use in absence of proprietary information. It is important to note that the formulation of the machine
learning models can easily be extended to accommodate the incorporation of proprietary information.
Following the steps shown in the Figure 1, we start the process by collecting the required data. As mentioned before,
we restrict our feature space to publicly available data. Following is a list of all the selected features along with the
respective data sources.
• Demand Data: Market demand data for all the routes between the 30 airports has been collected from the [27]
Database for the years 2005-2014. The raw data has been processed and compiled in the required format for the
analysis.
• Distance: The distance information between any two airports has been calculated from both the BTS data and the
latitude-longitude data; for the computation, the geosphere R package was used, and distances were verified with
Google maps.
• Population: The population data for each city where the airports are located has been collected from [28].
• Economy Metric: Per capita income of the cities is collected from Department of Numbers [29] and National
GDP is collected from Bureau of Economic Analysis [30] for the years 2005-2014.
Next, we applied a MinMax scaling to all the parameters, i.e., mapped the continuous input to [0,1] such that 0 and 1
corresponds to the minimum and the maximum respectively. We also added a filter to remove data points with the
annual demand less than 10,000 before scaling the data. The filter was applied to get rid of some of the outliers. Since,
we also plan to assess the classification techniques, we converted our continuous demand data into five levels of demand
by taking a logarithm of the unscaled data and dividing it into five equally space categories. These five levels are used
as the target variable for the classification algorithms. The regression approach corresponds to the demand estimation
problem whereas the the classification approach corresponds to the demand forecasting problem where we are trying to
predict the level of the demand given the predictor variables.

6
IV. Results and Comparison of the Algorithms
In this section, we describe the results that were obtained using different algorithms on the same dataset. It is
important to keep in mind that the aim of this paper is not to create the best demand model but to describe the process of
tuning the parameters to get a good model for these algorithms and more importantly, compare algorithms based upon
their ease of use and other properties. All the models are implemented using standard libraries from one of the most
popular Machine Learning tools, scikit-learn [31].

A. CART Results
We implemented the CART, using the standard DecisionTreeRegressor and DecisionTreeClassifier modules. We
used 10-fold cross validation to assess the performance of the models. For the classification problem, mean accuracy on
the test data was computed whereas for the regression, mean R2 value on the test data was used to assess the performance.
There are various hyperparameters that need to be tuned for a decision tree, such as, maximum depth of the
tree, maximum number of features to be used, minimum number of samples required to split a node, minimum
number of sample points in a leaf, and splitting metric. Due to large number of possible combinations, we implement
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

RandomizedSearchCV where a randomized search is performed over a set of parameters by sampling each setting from a
distribution over possible parameter values. This method is preferred over an exhaustive search when the design space is
computationally too expensive to explore. In the figure 5, we show the model performance with respect to the maximum
depth of a tree. We can observe that after a specific maximum depth, the training accuracy becomes 1.0 implying that
the model completely fits the training data but the testing accuracy settles at a lower value. In both cases, we observe the
maximum performance with the maximum depth of 16, and thus, we set the hyperparameter value to 16 as increasing it
further doesn’t make any difference to the model performance.
It is important to note that the major benefit of a decision tree lies in easy representation of the data and therefore, it
is desirable to keep the tree depth low to keep the model human-readable.

(a) Tuning maximum depth for Regression (b) Tuning maximum depth for Classification

Fig. 5 Tuning hyperparameter, Maximum Depth for the Decision Tree

B. SVM Results
We implemented the SVM with Radial Basis Function kernel for both classification and regression problem. Again,
10-fold cross validation was used to assess the performance of the models with mean R2 and mean accuracy as the
scoring parameter for regression and classification models respectively. We tuned the cost (or penalty) hyperparameter,
using a GridSearchCV, to obtain the optimal model. Figure 6 shows the variation in model performance with varying
cost parameter. It is important to understand that the penalty parameter trades off misclassification/bias with the
simplicity/variance of the model, i.e., increasing the penalty value yields a more complex model resulting in less bias
whereas decreasing the cost parameter decreases the variance of the model.
In our case, the best regression model had a mean R2 value of 0.408 with a standard deviation of 0.086 and the
best classification model had a mean accuracy of 0.616 with a standard deviation of 0.031. In general, SVM produces
models with good prediction accuracies but it requires a large sample size for that. In our sample problem, the dataset is

7
not large enough to achieve that and this is why we observe decision tree outperforming SVM by a huge margin.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

(a) Tuning penalty parameter for SVM Regression (b) Tuning penalty parameter for SVM Classification

Fig. 6 Tuning penalty parameter for Support Vector Machine

C. Neural Network Results


We used a multi-layered perceptron (MLP) to illustrate use of a form of neural network,in this case, a feed-forward
neural network. MLPs can be adapted to perform both regression and classification tasks, and generally utilize a
backpropagation algorithm to determine relevant network weights. In our concept demand prediction model, we perform
both the regression and classification cases of outputs similar in other approaches. An MLP network consisting of a
(200 x 200) network was used, and, weights were tuned using a Limited Memory Broyden-Fletcher-Goldfarb-Shanno
algorithm (LBFGS) approach, a quasi-Newton optimization algorithm that uses a limited amount of computer memory -
a key feature in machine learning applications. Figure7 shows results of the MLP neural network tuning - the best tuned
point corresponds to an accuracy of approximately R2 = 0.459 at a regularization value of 0.01 for the case of regression,
and, an accuracy of 0.624 at a regularization value of 0.0016 for the case of classification (expressed as accuracies for
the test case, with both implementations utilizing L2-regularization)

(a) Tuning Penalty parameter for Neural Network Regression (b) Tuning Penalty parameter for Neural Net Classification

Fig. 7 Tuning Penalty parameter for Neural Network

While neural networks have found great success in many complex problems ranging from pattern recognition to
stock predictions, in our sample case, the MLP does not perform very well, despite the large number of neural network
nodes, due to the very limited data used; this highlights a key point that while methods may be generally effective
in dealing with complex interactions, the caveats on data requirements can limit their full potential. In our case, the

8
addition of additional nodes generally improves the predictions, but at large computational cost and risk of over-fitting
relatively small data sets to large number of neural network features.

D. Ensemble Methods Results


As decision tree produced good performance, we implemented the Random Forest algorithm for both classification
and regression problems. A random forest fits a number of decision trees on sub-samples of the data and use averaging
to improve the accuracy and reduce over-fitting. We used a maximum depth of 10 for the decision trees and observed the
performance with respect to the number of estimators being used. Figure 8 shows the results and we can observe that
the performance of Random Forest is very stable and less susceptible to over-fitting, as per our expectations. We tested
the model up to 100 estimators but we can see that a good performance can be achieved for as low as 15 estimators. The
best performing regression model had score of 0.823 with a standard deviation of 0.09 whereas the best classification
model had score of 0.829 with a standard deviation of 0.03.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

(a) Tuning #Estimators for Random Forest Regression (b) Tuning #Estimators for Random Forest Classification

Fig. 8 Tuning Number of Estimators for Random Forest

E. Comparison of the Machine Learning Algorithms


In this section, we perform a qualitative comparison of the tested algorithms to guide practitioners in selecting an
appropriate machine learning algorithm. Discussion of all the pros and cons of each individual algorithm is beyond the
scope of this paper; as there is no perfect answer to the algorithm selection but we provide a comparison based upon the
general performance observed in the literature and this study.
Generally speaking, SVMs and Neural Network perform with good accuracy for complex data sets but they require a
large sample size to achieve that high accuracy. As we observed in our study, both SVMs and Neural Networks perform
poorly owing to a small data set of about 3000 sample points. For the problem at hand, decision trees and random
forests worked well. Decision trees are quite prone to over-fitting as the maximum depth is increased but an ensemble
of decision trees, such as, random forest, can handle this issue very well. Similarly, if a large number of layers are
considered for a neural network, it might lead to over-fitting. Moreover, it is quite difficult to control the over-fitting
issue in the NN due to its complex architecture.
We also observed that SVM and NN takes a really long time to learn as compared to decision trees and corresponding
ensemble methods. Although once the models are prepared, all the discussed algorithms can perform prediction very
quickly. Another factor that adds to the ’ease-of-use’ of an algorithm is the number of hyperparameters that the user
needs to tune. The NN and SVM are by-far the most complex algorithms and require tuning of a large number of
hyperparameters. We restricted ourselves to the penalty hyperparameter only (for SVM and NN) in our study in the
interest of time.
There is general agreement that NN is very sensitive to irrelevant features, due to the construction of the algorithm.
This can lead to very inefficient training process. All the other discussed algorithms handle irrelevant data pretty well.
In our study, we observed a similar trend where the predictor variable, gdp, was the least effective variable.
Another important characteristic of any machine learning model is the ability to understand the created model.

9
Decision trees and Neural Networks lie at the opposite ends of the spectrum with regards to the transparency of the fitted
model. In the decision trees, you can observe each node where decisions are being made whereas in the neural networks,
the fitted model is a black-box where it is very difficult to interpret the impact of individual predictor variables. SVM,
paired with complex kernel functions, also present a similar issue as neural networks.
There is no single algorithm that can uniformly outperform other algorithms over all datasets. Therefore, the
qualitative comparison summarized in the Table 1, is meant to be a guideline for the practitioners to start with an
appropriate algorithm for the problem at hand. The Expected Performance is compiled from existing empirical and
theoretical studies whereas the Observed Performance is specific to the problem discussed in the paper.

Table 1 Comparison summary (*** stars denote the best & * star the worst performance) adapted from [17]

Decision SVM Neural Ensemble


Trees Networks Methods
Expected ** *** *** ***
Accuracy in general
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

Observed *** ** ** ***


Expected *** * * **
Speed of learning
Observed *** * * ***
Expected *** *** *** ***
Speed of classification/prediction
Observed *** *** *** ***
Expected *** *** * ***
Tolerance to irrelevant attributes
Observed *** *** ** ***
Expected ** ** * **
Overfitting issues
Observed *** ** ** ***
Expected *** * * **
Readability/Transparency of knowledge
Observed *** * * **
Expected *** * * **
Ease of use - Tuning hyperparameters
Observed ** * * ***

V. Conclusion
In this paper, we provided a brief introduction to a diverse set of supervised machine learning techniques and
demonstrated their application to aviation problems by implementing them to a representative air travel demand modeling
problem. We performed a qualitative comparison of different machine learning techniques, namely, Classification and
Regression Trees (CART), Support Vector Machines (SVM), Neural Network and Ensemble Methods to provide a
guideline on choosing a suitable algorithm for a given problem. Authors believe that there exists a vast amount of
knowledge in the machine learning community that is directly applicable to the current aviation research problems. The
taxonomy differences, at times, restricts practitioners from accessing that information. With this paper, we take first
steps towards bridging that gap by providing an intuition for the most popular machine learning techniques, to readers
from aviation community.
For future work, authors aim to develop a detailed mapping of various categories of machine learning techniques to
prominent problems in the aviation community. Moreover, authors will continue to develop similar comparisons for
other categories of machine learning techniques.

Acknowledgments
The authors would like to thank Aparna Agrawal, SM Ferdous and Luis Zertuche for their assistance with the data
collection and processing.

10
References
[1] Smith, A., Collins, K., and Mavris, D., “Survey of Technology Forecasting Techniques for Complex Systems,” 58th
AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, 2017, p. 0974.

[2] Lee, H., Malik, W., and Jung, Y. C., “Taxi-out time prediction for departures at Charlotte airport using machine learning
techniques,” 16th AIAA Aviation Technology, Integration, and Operations Conference, 2016, p. 3910.

[3] Ukai, T., Chao, H., and DeLaurentis, D. A., “An Aircraft Deployment Prediction Model Using Machine Learning Techniques,”
17th AIAA Aviation Technology, Integration, and Operations Conference, 2017, p. 3081.

[4] Kotegawa, T., DeLaurentis, D. A., and Sengstacken, A., “Development of network restructuring models for improved air traffic
forecasts,” Transportation Research Part C: Emerging Technologies, Vol. 18, No. 6, 2010, pp. 937–949.

[5] Matthews, B., Das, S., Bhaduri, K., Das, K., Martin, R., and Oza, N., “Discovering anomalous aviation safety events using
scalable data mining algorithms,” Journal of Aerospace Information Systems, 2013.

[6] Janakiraman, V. M., and Nielsen, D., “Anomaly detection in aviation data using extreme learning machines,” Neural Networks
(IJCNN), 2016 International Joint Conference on, IEEE, 2016, pp. 1993–2000.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

[7] Christopher, A. A., Vivekanandam, V. S., Anderson, A. A., Markkandeyan, S., and Sivakumar, V., “Large-scale data analysis on
aviation accident database using different data mining techniques,” The Aeronautical Journal, Vol. 120, No. 1234, 2016, pp.
1849–1866.

[8] Burnett, R. A., and Si, D., “Prediction of Injuries and Fatalities in Aviation Accidents through Machine Learning,” Proceedings
of the International Conference on Compute and Data Analysis, ACM, 2017, pp. 60–68.

[9] Taneja, N. K., “A model for forecasting future air travel demand on the North Atlantic,” Tech. rep., Cambridge, Mass.
Massachusetts Institute of Technology, Flight Transportation Laboratory,[1971], 1971.

[10] Bafail, A. O., Abed, S. Y., Jasimuddin, S., and Jeddah, S., “The determinants of domestic air travel demand in the Kingdom of
Saudi Arabia,” Journal of Air Transportation World Wide, Vol. 5, No. 2, 2000, pp. 72–86.

[11] Murphy, K. P., Machine learning: a probabilistic perspective, MIT press, 2012.

[12] Zhang, S., Zhang, C., and Yang, Q., “Data preparation for data mining,” Applied Artificial Intelligence, Vol. 17, No. 5-6, 2003,
pp. 375–381.

[13] Batista, G. E., and Monard, M. C., “An analysis of four missing data treatment methods for supervised learning,” Applied
artificial intelligence, Vol. 17, No. 5-6, 2003, pp. 519–533.

[14] Niu, Z., Shi, S., Sun, J., and He, X., “A survey of outlier detection methodologies and their applications,” Artificial intelligence
and computational intelligence, 2011, pp. 380–387.

[15] Yu, L., and Liu, H., “Efficient feature selection via analysis of relevance and redundancy,” Journal of machine learning research,
Vol. 5, No. Oct, 2004, pp. 1205–1224.

[16] Markovitch, S., and Rosenstein, D., “Feature generation using general constructor functions,” Machine Learning, Vol. 49, No. 1,
2002, pp. 59–98.

[17] Kotsiantis, S. B., Zaharakis, I., and Pintelas, P., “Supervised machine learning: A review of classification techniques,” , 2007.

[18] Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A., Classification and regression trees, CRC press, 1984.

[19] Kullback, S., and Leibler, R. A., “On information and sufficiency,” The annals of mathematical statistics, Vol. 22, No. 1, 1951,
pp. 79–86.

[20] Vapnik, V., The nature of statistical learning theory, Springer science & business media, 2013.

[21] Veropoulos, K., Campbell, C., Cristianini, N., et al., “Controlling the sensitivity of support vector machines,” Proceedings of
the international joint conference on AI, 1999, pp. 55–60.

[22] Schölkopf, B., Burges, C. J., and Smola, A. J., Advances in kernel methods: support vector learning, MIT press, 1999.

[23] Crammer, K., and Singer, Y., “On the algorithmic implementation of multiclass kernel-based vector machines,” Journal of
machine learning research, Vol. 2, No. Dec, 2001, pp. 265–292.

11
[24] Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J., and Vapnik, V., “Support vector regression machines,” Advances in
neural information processing systems, 1997, pp. 155–161.

[25] Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016. http://www.deeplearningbook.org.

[26] Kuhn, M., and Johnson, K., Applied Predictive Modeling, SpringerLink : Bücher, Springer New York, 2013. URL
https://books.google.com/books?id=xYRDAAAAQBAJ.

[27] “RITA-BTS-Transtats,” , 2017. URL http://www.transtats.bts.gov/databases.asp?mode_id=1&mode_desc=


aviation&subject_id2=0.

[28] “Population data ,” , 2017. URL http://factfinder.census.gov/.

[29] “Per Capita Income Data ,” , 2017. URL http://www.deptofnumbers.com/income/metros/.

[30] “US GDP and Personal Income Data ,” , 2017. URL https://www.bea.gov/iTable/index_nipa.cfm.

[31] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R.,
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980

Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E., “Scikit-learn: Machine
Learning in Python,” Journal of Machine Learning Research, Vol. 12, 2011, pp. 2825–2830.

12

View publication stats

You might also like