Professional Documents
Culture Documents
net/publication/325964700
CITATIONS READS
18 5,353
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Apoorv Maheshwari on 05 July 2018.
Machine learning is becoming a very popular way to find patterns in complex data. With
the advancements in the storage and computational capabilities, a lot of machine learning tech-
niques are becoming suitable for real-world applications. In author’s canvassing of literature,
the adoption of machine learning techniques in the aviation community is low as compared to
other communities, predominantly due to unavailability of access to high-quality data and high
reliance on simple, easily interpretable models as compared to complex predictive models. In
addition to these, the taxonomy differences between computer science and aviation community
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
also makes the adoption difficult. In this paper, we perform a comparative study of popular
supervised machine learning techniques for aviation problems using an air travel demand
modeling problem as an example. We implement Classification and Regression Trees, Support
Vector Machines, Neural Networks, and Ensemble Methods on the air travel demand estima-
tion and forecasting problems. With the help of this work, we plan to provide a qualitative
comparison of these techniques to serve as a guideline on choosing a suitable algorithm for a
given problem.
I. Introduction
achine learning is a field of computer science that focuses on algorithms that can learn from and perform predictive
M analysis on data. The idea of using data to create a model is not new and has always been the basis of scientific
research, especially, experimental research work. The storage and computational capacity has developed rapidly in
recent years, making a lot of techniques that were proposed in the early 19th century (or even earlier) suitable for
real-world applications. With recent advancements in the data processing frameworks, cloud services, tool/libraries
to implement a variety of algorithms, the effort to implement any new technique on a provided dataset has reduced
considerably. In fact, most of the commonly used machine learning tools are developed with a special focus on ease of
use across different communities. We have already started to experience machine learning in our daily routine in the
form of smart assistants in our mobile phones, spam filtering in our emails, computer vision technologies in our cars,
preventive health care through smart medication/appointment reminders, text-to-speech natural language processing and
so on.
As machine learning starts to gain prominence in all these areas, there is a need to systematically start using this
knowledge for aviation problems. One of the biggest benefits of using machine learning techniques over the conventional
statistical approaches is that the model form is mostly driven by the dataset rather than initial assumptions placed upon
the form of the model. For example, in conventional statistical approaches, the data is usually fit to an assumed form of a
mathematical model, enforcing natural restrictions on the effectiveness of the model due to unknown dynamics of what
generates the data. Whereas in machine learning, fewer restrictions are placed and thus, allows for more flexibility in the
modeling. The flip side is that these machine learning techniques, typically, require a huge dataset and computation
capability to create the model. Moreover, the better performance of these techniques comes at a price of increase in the
complexity which makes it difficult to interpret the models in comparison to the conventional statistical approaches.
Researchers are gradually beginning to implement machine learning techniques to aviation problems but the adoption
is not fast enough due to unavailability of high-quality data that is typically held private by large industrial organizations,
and high reliance on the experts opinion rather than data-driven complex models. Smith et al. [1] discuss different
technology forecasting techniques for complex systems and identified machine learning useful for providing estimates
for future technology predictions. Lee et al. [2] used American Airlines provided data to compare various machine
learning techniques to predict taxi-out time at Charlotte Airport. They found that the machine learning algorithms
∗ Graduate Research Assistant, Aeronautics and Astronautics, 701 W. Stadium Ave., AIAA Student Member
† Research Scientist, Aeronautics and Astronautics, 701 W. Stadium Ave., AIAA Member
‡ Professor, Aeronautics and Astronautics, 701 W. Stadium Ave., AIAA Associate Fellow
employed to model and forecast total transatlantic air travel demand. A more recent attempt using regression tools was
made by Bafail et al.[10]; this reference provides insights into the selection of process of explanatory variables, and the
potential problem that arises with when there is multi-collinearity between predictors. These studies serve as good
starting points for us to select appropriate features for our machine learning study. In the case study, we model the air
travel demand between any two cities, based on the socio-technical factors, using machine learning techniques. To
reduce the complexity of the example problem, we will restrict our analysis to top 30 airports (enplanement-wise) of the
US domestic air transportation network. We will also restrict our feature space to publicly available data in this analysis
so that the policy makers who might not have access to the proprietary information of various service providers (such as
airlines), can also make use of these models for future planning.
The overarching research objective of our work is to compare various machine learning techniques for aviation
problems with the help of an air travel demand modeling problem as an example.
In section II, we give a quick introduction to different categories of machine learning approaches that are discussed in
this paper. In section III, we describe the problem setup and the data used for the analysis. In section IV, implementation
of different machine learning algorithms is discussed along with the comparison of the algorithms and the V provides a
summary of our work and future vision.
2
be formulated with an objective to estimate the level of demand (such as very low-low-medium-high-very high)
instead of getting an exact numerical value.
3
A typical decision tree algorithm will work top-down and will choose the variable at each step by maximizing the
split between the relevant dataset. Different algorithms use different metrics to quantify the split. Some of the most
commonly used metrics are Gini impurity [18] and information gain [19].
Decision Trees are preferred in the machine learning community as they often provide a fast and an easy to understand
representation of the data. Decision tree models are very useful when modeling human decisions and behavior.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
Fig. 2 Decision tree for survival on Titanic (Values under the leaves show the probability of survival and the
percentage of observations in the leaf)
wT xi + b ≥ 1 ∀yi = 1
w xi + b ≤ −1
T
∀yi = −1
In this case, the decision rule is fw,b (x) = sgn(wT x + b) where w is the weight vector and b is the bias. In other words,
for linearly separable data, we can find two parallel hyperplanes that separate the two classes of data such that the
distance between them is maximized. We define the region between these two hyperplanes as margin and thus, the
maximum-margin hyperplane lies at an equal distance from both of the previously identified hyperplanes.
Thus identifying the maximum-margin hyperplane (when data is linearly separable), can be formulated as an
optimization problem:
Minimize ||w||
subject to yi (wT xi + b) ≥ 1, for i = 1, ..., n
When data is not linearly separable, Veropoulos et al.[21] proposed the concept of soft margin by introducing positive
slack variables ξi = max(0, 1 − yi (wT xi + b)), i = 1, ..., N in the constraints. The modified optimization problem is:
n
1Õ
Minimize λ||w|| 2 + ξi
n i=1
subject to yi (wT xi + b) ≥ 1, for i = 1, ..., n
ξi ≥ 0
4
where λ determines the trade-off between the current prediction and size of the margin. In real-world applications, the
dataset is rarely found to be linearly separable and a popular approach is to map the data to a higher-dimensional space
(computationally enabled by kernel functions [22]) and define the separating hyperplane there.
Crammer and Singer [23] proposed a variation of this binary classification problem to extend it to multiclass problem.
Similarly, Drucker et al. [24] provided a formulation to extend the classification problem to regression problem. The
regression approach is commonly referred to as Support Vector Regression (SVR).
Once the separating hyperplane is identified, most of the data other than the points closest to the margin (known as
the support vectors) becomes redundant. Thus, SVM provides a great way to generalize the given dataset and usually,
robust against small changes in the input dataset. Due to these advantages, SVM is very popular in text categorization
and recognizing hand-written characters.
3. Neural Networks
Neural networks (or deep learning, as it is more recently called) is a technique inspired by how neurons work in
biological systems (i.e. the brain). The idea is to form a mathematical representation that is similar to how interconnected
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
neurons in biological systems are wired to fire and produce an output, given some form of an stimulus signal. These
mathematical functions are expressed as nodes, and are interconnected to form a complex web of inputs and outputs
among and between said nodes; the sequence, types of functions used and connectivity between the neural network’s
nodes dictate the effectiveness in dealing with various forms of machine learning problems. Figure3 below shows a
notional neural network setup where input nodes denote the inputs variables, the hidden nodes represent intermediate
values calculated based on selected activation functions of the chosen type of neural network setup and weighted inputs,
and, finally the outputs nodes that can be predicted outcome values.The neural network learns a set of weighting factors
to improve the mapping of input data to some desired (observed) output data.
While neural networks have been around since the 1940s, their use and popularity diminished until recently. The
resurgence in the use, research and development of neural networks and machine learning in general, has been fueled by
a range of factors that includes: advances in nonlinear optimization techniques, computational hardware and software
improvements, and, voluminous data to name a few. These, among other factors, have led to increased application of
neural networks in complex applications, resulting in increased accuracy [25].
4. Ensemble Methods
Ensemble methods are meta-algorithms that extend machine learning techniques to include consideration for
weighting different possible models used in predictions. In a sense, the predicted outcome can be generated from a
weighted (or other forms of aggregation) set of outcomes from multiple models that either use input data wholly, or, in
part for each constituent model. Examples of ensemble based methods include bootstrapping[18], bagging (also known
as bootstrap-aggregating), and stacking[26]. Figure4highlights an example architecture for an ensemble method where
the a set of data is used by multiple models to generate individual multiple predictions. In this case, we take the example
of decision trees (e.g. Classification and Regression Trees as explained earlier) where different tree structures may be
5
formed, and different sample of subsets of the data used. In this case, the use of a Bagging ensemble approach such as
Random Forests to generate different tree structures and consequently, different predictions, generally results in good
prediction performance when the individual model predictions are aggregated. The intuition behind ensemble method
approaches is that typical model errors come from a trade-off between model bias and variance. In the case of ensemble
methods, the idea is to aggregate away sources of these errors across the models used, as these sources of error can only
be traded off within any individual model.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
6
IV. Results and Comparison of the Algorithms
In this section, we describe the results that were obtained using different algorithms on the same dataset. It is
important to keep in mind that the aim of this paper is not to create the best demand model but to describe the process of
tuning the parameters to get a good model for these algorithms and more importantly, compare algorithms based upon
their ease of use and other properties. All the models are implemented using standard libraries from one of the most
popular Machine Learning tools, scikit-learn [31].
A. CART Results
We implemented the CART, using the standard DecisionTreeRegressor and DecisionTreeClassifier modules. We
used 10-fold cross validation to assess the performance of the models. For the classification problem, mean accuracy on
the test data was computed whereas for the regression, mean R2 value on the test data was used to assess the performance.
There are various hyperparameters that need to be tuned for a decision tree, such as, maximum depth of the
tree, maximum number of features to be used, minimum number of samples required to split a node, minimum
number of sample points in a leaf, and splitting metric. Due to large number of possible combinations, we implement
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
RandomizedSearchCV where a randomized search is performed over a set of parameters by sampling each setting from a
distribution over possible parameter values. This method is preferred over an exhaustive search when the design space is
computationally too expensive to explore. In the figure 5, we show the model performance with respect to the maximum
depth of a tree. We can observe that after a specific maximum depth, the training accuracy becomes 1.0 implying that
the model completely fits the training data but the testing accuracy settles at a lower value. In both cases, we observe the
maximum performance with the maximum depth of 16, and thus, we set the hyperparameter value to 16 as increasing it
further doesn’t make any difference to the model performance.
It is important to note that the major benefit of a decision tree lies in easy representation of the data and therefore, it
is desirable to keep the tree depth low to keep the model human-readable.
(a) Tuning maximum depth for Regression (b) Tuning maximum depth for Classification
B. SVM Results
We implemented the SVM with Radial Basis Function kernel for both classification and regression problem. Again,
10-fold cross validation was used to assess the performance of the models with mean R2 and mean accuracy as the
scoring parameter for regression and classification models respectively. We tuned the cost (or penalty) hyperparameter,
using a GridSearchCV, to obtain the optimal model. Figure 6 shows the variation in model performance with varying
cost parameter. It is important to understand that the penalty parameter trades off misclassification/bias with the
simplicity/variance of the model, i.e., increasing the penalty value yields a more complex model resulting in less bias
whereas decreasing the cost parameter decreases the variance of the model.
In our case, the best regression model had a mean R2 value of 0.408 with a standard deviation of 0.086 and the
best classification model had a mean accuracy of 0.616 with a standard deviation of 0.031. In general, SVM produces
models with good prediction accuracies but it requires a large sample size for that. In our sample problem, the dataset is
7
not large enough to achieve that and this is why we observe decision tree outperforming SVM by a huge margin.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
(a) Tuning penalty parameter for SVM Regression (b) Tuning penalty parameter for SVM Classification
(a) Tuning Penalty parameter for Neural Network Regression (b) Tuning Penalty parameter for Neural Net Classification
While neural networks have found great success in many complex problems ranging from pattern recognition to
stock predictions, in our sample case, the MLP does not perform very well, despite the large number of neural network
nodes, due to the very limited data used; this highlights a key point that while methods may be generally effective
in dealing with complex interactions, the caveats on data requirements can limit their full potential. In our case, the
8
addition of additional nodes generally improves the predictions, but at large computational cost and risk of over-fitting
relatively small data sets to large number of neural network features.
(a) Tuning #Estimators for Random Forest Regression (b) Tuning #Estimators for Random Forest Classification
9
Decision trees and Neural Networks lie at the opposite ends of the spectrum with regards to the transparency of the fitted
model. In the decision trees, you can observe each node where decisions are being made whereas in the neural networks,
the fitted model is a black-box where it is very difficult to interpret the impact of individual predictor variables. SVM,
paired with complex kernel functions, also present a similar issue as neural networks.
There is no single algorithm that can uniformly outperform other algorithms over all datasets. Therefore, the
qualitative comparison summarized in the Table 1, is meant to be a guideline for the practitioners to start with an
appropriate algorithm for the problem at hand. The Expected Performance is compiled from existing empirical and
theoretical studies whereas the Observed Performance is specific to the problem discussed in the paper.
Table 1 Comparison summary (*** stars denote the best & * star the worst performance) adapted from [17]
V. Conclusion
In this paper, we provided a brief introduction to a diverse set of supervised machine learning techniques and
demonstrated their application to aviation problems by implementing them to a representative air travel demand modeling
problem. We performed a qualitative comparison of different machine learning techniques, namely, Classification and
Regression Trees (CART), Support Vector Machines (SVM), Neural Network and Ensemble Methods to provide a
guideline on choosing a suitable algorithm for a given problem. Authors believe that there exists a vast amount of
knowledge in the machine learning community that is directly applicable to the current aviation research problems. The
taxonomy differences, at times, restricts practitioners from accessing that information. With this paper, we take first
steps towards bridging that gap by providing an intuition for the most popular machine learning techniques, to readers
from aviation community.
For future work, authors aim to develop a detailed mapping of various categories of machine learning techniques to
prominent problems in the aviation community. Moreover, authors will continue to develop similar comparisons for
other categories of machine learning techniques.
Acknowledgments
The authors would like to thank Aparna Agrawal, SM Ferdous and Luis Zertuche for their assistance with the data
collection and processing.
10
References
[1] Smith, A., Collins, K., and Mavris, D., “Survey of Technology Forecasting Techniques for Complex Systems,” 58th
AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, 2017, p. 0974.
[2] Lee, H., Malik, W., and Jung, Y. C., “Taxi-out time prediction for departures at Charlotte airport using machine learning
techniques,” 16th AIAA Aviation Technology, Integration, and Operations Conference, 2016, p. 3910.
[3] Ukai, T., Chao, H., and DeLaurentis, D. A., “An Aircraft Deployment Prediction Model Using Machine Learning Techniques,”
17th AIAA Aviation Technology, Integration, and Operations Conference, 2017, p. 3081.
[4] Kotegawa, T., DeLaurentis, D. A., and Sengstacken, A., “Development of network restructuring models for improved air traffic
forecasts,” Transportation Research Part C: Emerging Technologies, Vol. 18, No. 6, 2010, pp. 937–949.
[5] Matthews, B., Das, S., Bhaduri, K., Das, K., Martin, R., and Oza, N., “Discovering anomalous aviation safety events using
scalable data mining algorithms,” Journal of Aerospace Information Systems, 2013.
[6] Janakiraman, V. M., and Nielsen, D., “Anomaly detection in aviation data using extreme learning machines,” Neural Networks
(IJCNN), 2016 International Joint Conference on, IEEE, 2016, pp. 1993–2000.
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
[7] Christopher, A. A., Vivekanandam, V. S., Anderson, A. A., Markkandeyan, S., and Sivakumar, V., “Large-scale data analysis on
aviation accident database using different data mining techniques,” The Aeronautical Journal, Vol. 120, No. 1234, 2016, pp.
1849–1866.
[8] Burnett, R. A., and Si, D., “Prediction of Injuries and Fatalities in Aviation Accidents through Machine Learning,” Proceedings
of the International Conference on Compute and Data Analysis, ACM, 2017, pp. 60–68.
[9] Taneja, N. K., “A model for forecasting future air travel demand on the North Atlantic,” Tech. rep., Cambridge, Mass.
Massachusetts Institute of Technology, Flight Transportation Laboratory,[1971], 1971.
[10] Bafail, A. O., Abed, S. Y., Jasimuddin, S., and Jeddah, S., “The determinants of domestic air travel demand in the Kingdom of
Saudi Arabia,” Journal of Air Transportation World Wide, Vol. 5, No. 2, 2000, pp. 72–86.
[11] Murphy, K. P., Machine learning: a probabilistic perspective, MIT press, 2012.
[12] Zhang, S., Zhang, C., and Yang, Q., “Data preparation for data mining,” Applied Artificial Intelligence, Vol. 17, No. 5-6, 2003,
pp. 375–381.
[13] Batista, G. E., and Monard, M. C., “An analysis of four missing data treatment methods for supervised learning,” Applied
artificial intelligence, Vol. 17, No. 5-6, 2003, pp. 519–533.
[14] Niu, Z., Shi, S., Sun, J., and He, X., “A survey of outlier detection methodologies and their applications,” Artificial intelligence
and computational intelligence, 2011, pp. 380–387.
[15] Yu, L., and Liu, H., “Efficient feature selection via analysis of relevance and redundancy,” Journal of machine learning research,
Vol. 5, No. Oct, 2004, pp. 1205–1224.
[16] Markovitch, S., and Rosenstein, D., “Feature generation using general constructor functions,” Machine Learning, Vol. 49, No. 1,
2002, pp. 59–98.
[17] Kotsiantis, S. B., Zaharakis, I., and Pintelas, P., “Supervised machine learning: A review of classification techniques,” , 2007.
[18] Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A., Classification and regression trees, CRC press, 1984.
[19] Kullback, S., and Leibler, R. A., “On information and sufficiency,” The annals of mathematical statistics, Vol. 22, No. 1, 1951,
pp. 79–86.
[20] Vapnik, V., The nature of statistical learning theory, Springer science & business media, 2013.
[21] Veropoulos, K., Campbell, C., Cristianini, N., et al., “Controlling the sensitivity of support vector machines,” Proceedings of
the international joint conference on AI, 1999, pp. 55–60.
[22] Schölkopf, B., Burges, C. J., and Smola, A. J., Advances in kernel methods: support vector learning, MIT press, 1999.
[23] Crammer, K., and Singer, Y., “On the algorithmic implementation of multiclass kernel-based vector machines,” Journal of
machine learning research, Vol. 2, No. Dec, 2001, pp. 265–292.
11
[24] Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J., and Vapnik, V., “Support vector regression machines,” Advances in
neural information processing systems, 1997, pp. 155–161.
[25] Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016. http://www.deeplearningbook.org.
[26] Kuhn, M., and Johnson, K., Applied Predictive Modeling, SpringerLink : Bücher, Springer New York, 2013. URL
https://books.google.com/books?id=xYRDAAAAQBAJ.
[30] “US GDP and Personal Income Data ,” , 2017. URL https://www.bea.gov/iTable/index_nipa.cfm.
[31] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R.,
Downloaded by PURDUE UNIVERSITY on July 5, 2018 | http://arc.aiaa.org | DOI: 10.2514/6.2018-3980
Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E., “Scikit-learn: Machine
Learning in Python,” Journal of Machine Learning Research, Vol. 12, 2011, pp. 2825–2830.
12