You are on page 1of 7

Flight Delay Prediction System using Data Mining

Approach

Durga Ambekar Aaina Jain


Information Technology Information Technology
MCT Rajiv Gandhi Institute of Technology MCT Rajiv Gandhi Institute of Technology
Mumbai, India Mumbai, India
durgaambekar@gmail.com jain.aaina16@gmail.com

Shreyas Jadhav Abhay E. Patil


Information Technology Information Technology
MCT Rajiv Gandhi Institute of Technology MCT Rajiv Gandhi Institute of Technology
Mumbai, India Mumbai, India
jshreyas.sj@gmail.com abhay.patil@mctrgit.ac.in

Abstract--Growth in aviation industries has Conventionally if a flight's departure time or arrival


resulted in air-traffic jamming causing flight time is greater than 15 minutes than its scheduled
delays. Flight delays not only have economic departure and arrival times respectively, then it is
impact but also injurious environmental considered that there is a departure or arrival delay
properties. Air-traffic supervision is becoming with respect to corresponding airports. Notable
increasingly challenging. Airlines delays make reasons for commercially scheduled flights to delay
immense loss for business field as well as in budget are adverse weather conditions, air traffic congestion,
loss for a country, there are so many reasons for late reaching aircraft to be used for the flight from
impede in flights some of them are, some of them previous flight, maintenance and security issues. An
are due to security issues, mechanical problems, Intelligent and Automated Prediction System is a
due to weather conditions, Airport congestion etc. must in this case that can predict possible airline
we are proposing machine learning algorithms delay. This project aims at analyzing flight
like SVM regressed, decision tree regression information of domestic flights operated by Airlines,
techniques and hybrid ensemble regression covering Top 5 busiest airports and predicting
technique. The aim of this research work is to possible arrival delay and departure delay of the flight
predict flight delay, which is highest economy using Data Mining and Machine Learning
producing field for many countries and among approaches. As population increases tremendously
many transportation this one is fastest and and time is everything for many billionaire. Here the
comfort, so to identify and reduce flight delays, importance of Flights were raised, but due to high cost
can dramatically reduce the flight delays to saves and some continuous delay of flight made less eyes
huge amount of turnovers, using machine- on flights in 1960’s, but due to government help many
learning algorithms. companies have been started manufacturing flights
with less cost and more comfort and many Airports,
Keywords: Flight Delay Prediction, SVM, Decision this made control of airlines traffic. Airlines Economy
Tree Regression, R2, MSE. play a predominant role in countries economy, so
there is huge losses had occurred, we all know recent
I. INTRODUCTION technology of Machine learning is one of the way to
determine the flight delays. Mining techniques for
Delay is one of the most remembered performance instances applied to airlines topics rise rapidly.
indicators of any transportation system. Notably,
commercial aviation players understand delay as the
period by which a flight is late or postponed.
II. RELATED WORK departure delays are highly correlated. Correlation
between arrival and departure delays is extremely
A good amount of research attention has been high (around0.9)
dedicated to the study of flight delays; predicting [5] Anish M. Kalliguddi and Aera K. Leboulluec”
and analyzing the delays and their reasons have long Predictive Modeling of Aircraft Flight Delay”,
been active subjects of research because of their Universal Journal of Management: 485- 491, 2017.
essential significance in air traffic control, airline Kalliguddi et al constructed regression models like
decision making and ground delay program. Decision Tree Regressor, Random Forest Regressor
Different researchers have studied this issue from and Multiple Linear Regressor on flight data for
various outlooks predicting both departure and arrival delays.
[6] Brett Naul” Airline Departure Delay Prediction”.
• Review on Machine Learning Techniques: Naul applied Logistic Regression, Naive Bayes
[1] Chakrabarty, Navoneel, et al.” Flight Arrival Classifier and Support Vector Machine on flight
Delay Prediction Using Gradient Boosting data for prediction of flight departure delay. The aim
Classifier.” Emerging Technologies in Data of this project is to use large historical datasets to
Mining and Information Security. Springer, make predictions about the punctuality of future
flights far in advance.
Singapore, 2019.
Chakrabarty et al proposed a Machine Learning
Model using Gradient Boosting Classifier for • Review on Deep Learning Techniques:
predicting flight arrival delay. This paper [7] Young Jin Kim, Sun Choi, Simon Briceno,
proposed a hyper-parameter tuned approach by Dimitri Mavris ”A deep learning approach to flight
the application of Grid Search on Gradient delay prediction”, 35th Digital Avionics Systems
Boosting Classifier Model on flight data. Conference (DASC), 2016.
[2] Suvojit Manna, Sanket Biswas, Riyanka Kundu, Kim et al implemented a Deep Learning Approach
Somnath Rakshit, Priti Gupta, Subhas Burman” using Recurrent Neural Networks (RNNs) for
A statistical approach to predict flight delay predicting flight delay. They have used long short-
using gradient boosted decision tree”, term memory RNN architecture for predicting flight
International Conference on Computational delays.
Intelligence in Data Science (ICCIDS), 2017. [8] Sina Khanmohammadi, Salih Tutun, Yunus
Manna et al explored and analyzed the flight data Kucuk” A New Multilevel Input Layer Artificial
and developed a regression model using Neural Network for Predicting Flight Delays at JFK
Gradient Boosting Regressor for predicting both Airport”.
Flight Departure and Arrival Delays Khanmohammadi et al proposed a Deep Learning
respectively. Approach using Artificial Neural Network (ANN)
[3] Juan Jose Robollo and Hamsa Balakrishnan” and also introduced a new type of multilevel input
Characterization and Prediction of Air Traffic layer ANN.
Delays”
Rebollo applied Random Forest on an air traffic • Review in Big Data Approach:
network framework for predicting flight
departure delays in future. The main objective of [9] Loris Belcastro, Fabrizio Marozzo, Domenico Talia
this paper is to predict the departure delay on a and Paolo Trunfion ” Using Scalable Data Mining
particular link or at a particular airport, sometime for Predicting Flight Delays”
in the future. Belcastro et al proposed a Big Data Approach the
[4] Sruti Oza, Somya Sharma, Hetal Sangoi, main goal of this work is to implement a predictor
RutujaRaut, V.C. Kotak” Flight Delay of the arrival delay of a scheduled flight due to
Prediction System Using Weighted Multiple weather conditions by analyzing and mining flight
Linear Regression”, International Journal of information as well as corresponding weather
Engineering and Computer Science, Volume 4 conditions using parallel algorithms implemented as
Issue 4 April 2015. MapReduce programs executed on Cloud Platform
Oza at al attempted weather induced flight delay for weather induced flight delay prediction.
prediction by implementing Weighted Multiple
Linear Regression on weather-flight data having
weather factors. The research finds that arrival and
III. PROBLEM STATEMENT V. PROPOSED SYSTEM

The main objective here is to utilize the available An outline of the model developed to predict
flight operational data and data mining techniques delays of individual flights is shown in Fig. 1. The
to construct an analytical model. The analytical model consists of two main parts, the training
model constructed here is used to predict the flight process and the prediction process. The training
delay based on some of the flight attributes which process starts with data collection. Historical flight
will be discussed in the latter section of this paper. data and data corresponding to airlines, airports are
Additional models will be created to determine the collected and they are joined together using the
most likely cause of a flight delay and to predict the scheduled departure time and airport as the join
approximate duration of the delay. keys. In the preprocessing step, estimating missing
data and normalization are performed. Then the
IV. RESEARCH ATTRIBUTES training set is finally ready and it is used to train the
predictive model with sampling techniques. Data for
In this paper, we extract some attributes that the prediction process is collected and preprocessed
affect the flight delay, and formulate them as an in the same way as the training set. After that it is
input vector x in the proposed model as shown fed into the model trained with the training data. In
below in Table 1. the end, the model assigns each data point a label.
Airports Considered UA, AA, US, F9, B6, OO,
AS, NK, WN, DL, EV,
HA, MQ, VX
Flight On-time Scheduled Departure,
Performance data Departure Delay,
(Input) Scheduled Time, Elapsed
Time, Air Time, Distance,
Scheduled Arrival, Arrival
Delay, Previous Arrival
Delay, Previous Departure
Delay
Selected Features Scheduled Departure,
Departure Delay,
Scheduled Time, Elapsed
Time, Air Time,
Distance, Scheduled
Arrival, Arrival Delay,
Previous Arrival Delay,
Previous Departure Fig 1: Summary of the model developed
Delay, Month, Day of The above system proposed consists of 2 phases:
Month, Airline Name,
Origin Latitude, Origin Phase A: Data and Pre-processing
Longitude, Destination Phase B : Predictive Model
Latitude, Destination
Longitude A) Data and Pre-Processing
Classification (Output)1 - indicates occurrence of
delay i. Data Collection
0 - indicates absence of
delay To train and test models, we used a publicly
available dataset for domestic air traffic from
Regression (Output) Numerical value (Score) Kaggle. The original source of our dataset is the on-
of the flight delay line Bureau and Transportation Statistics database.
prediction Datasets of various airports, airlines, flights were
Table 1. Feature Study merged together with the help of joining keys and
the resultant dataset was the final dataset on which
models were deployed. The data set is from the year
2004-2019 and consists of well over 3 Million B) Predictive Model
examples with following features categorized as
follows: The model consists of 2 stages:

1. Information about flight (day, month, year, a. Departure Delay Prediction


airline, flight number, tail number) b. Arrival Delay Prediction
2. Information about origin and destination (origin
airport, destination airport) a. Departure Delay Prediction
3. Information about the departure (scheduled
departure, departure time, departure delay, taxi) The dataset was split into training data and test
4. Information about the flight-journey (air time, data in the ratio 90:10 for the purpose of evaluating
distance, hour, minute, time_hour) the model. The results of the model are illustrated in
5. Information about the arrival (scheduled arrival, Table I later. The Support Vector Machine
arrival time, arrival delay) Regression model performed better than the other
models with a R-Squared score of 0.17.
ii. Data Pre-Processing
b. Arrival Delay Prediction
Air traffic data for find major airports and
corresponding weather data are extracted. Similar to the departure delay prediction phase,
Following the rules of BTS, flights that arrive at the the dataset was split into training data and test data
gate within 15 minutes of the scheduled time are in the ratio of 90:10. The results of the classification
considered as on-time. Canceled and diverted stage in the model are shown in Table II later. As
flights in the training set are deemed as delayed. To seen from the table, the Support Vector Machine
deal with missing data often encoded as Regression model performs the best with R2 score
blanks, NaNs or other placeholders. Such datasets of 0.26, followed by Decision Tree Regressor with
however are incompatible with scikit-learn R2 score of 0.26.
estimators which assume that all values in an array
are numerical, and that all have and hold meaning. According to the standards set by the Airports
A basic strategy to use incomplete datasets is to Authority of India (AAI), a flight is said to have a
discard entire rows and/or columns containing class value of 0 if it departs or arrives no later than
missing values. However, this comes at the price of 15 minutes from the scheduled time, otherwise it is
losing data which may be valuable (even though said to have a class value of 1 and is considered as a
incomplete). A better strategy is to impute the delay. If the classification stage outputs 0, then there
missing values, i.e., to infer them from the known is an absence of delay. If the classification stage
part of the data. Here we use Imputer class outputs 1, then the regression stage predicts the
of sklearn module and the methods used are value of delay in terms of a predictive score.
transform and fit-transform.
The following algorithms showed the highest
The following data fields were extracted from the accuracy from those implemented:
final dataset for every scheduled flight because i) Support Vector Machine
those are factors having impacts on flight delays. ii) Decision Tree - Regression
iii) Stacking Algorithm (Hybrid Algorithm
• Year consisting of Random Forest Regressor, Decision
• Month Tree Regressor, Logistic Regression and SVM
• Day of Month Algorithm)
• Day of Week
• Departure and Arrival Schedule in Local Time i) Support Vector Machine (SVM)
• Arrival Delay Indicator: 0 if actual arrival time
minus scheduled arrival time is less than 15 The idea of SVM classification algorithm is:
minutes, 1 if actual arrival time minus scheduled using a nonlinear transform ϕ(x) to map the input
arrival time is greater than or equal to 15 minutes data to a higher-dimensional space, and then doing
linear classification of the input data in dimensional
feature space, so that the optimal hyperplane is
constructed.
ii) Decision Tree - Regression VI. RESULTS

Decision tree builds regression or classification A. Data Visualization


models within the variety of a tree structure. It
breaks down a dataset into smaller associated
smaller subsets whereas at identical time an
associated call tree is incrementally developed. the
ultimate result's a tree with call nodes and leaf
nodes. a choice node (e.g., Outlook) has 2 or a lot of
branches (e.g., Sunny, Overcast and Rainy), every
representing values for the attribute tested. Leaf
node (e.g., Hours Played) represents a choice on the
numerical target. The top call node in a very tree that
corresponds to the simplest predictor known as root Fig 2: Top 10 flights of US domestic air traffic
node. call trees will handle each categorical and
numerical information. In fig 2, we represent the top 10 flights of United
States domestic air traffic.
iii) Stacking Algorithm The dataset consist of 3 million examples.
By data visualization we found the number of
The construct of stacking rule is to tackle a flights that arrive on time or are delayed for United
retardant by breaking it into multiple sub-problems States.
with every sub-problem obtaining handled by
totally different algorithms. Every rule has its own B. Comparative Analysis
set of strengths and weaknesses. The intuition
concerned in stacking rule is to mix the strengths of SVM, Decision Tree Regression, Stacking models
various algorithms in-order to get a model that are applied on a dataset using Python programming
would cut back the error rate and boost the general language, to classify if a flight is delayed based
accuracy of the system. Stacking is taken into on characteristics such as Origin, Destination,
account as a hybrid rule because it consists of Scheduled Departure, Departure Delay, Scheduled
multiple algorithms. that consists of base learners Time, Arrival Delay, Previous Arrival Delay,
and a final learner. The results obtained by the Previous Departure Delay, Month, Day of Month,
bottom learners are treated as input for the ultimate Airline Name.
level regressor. The work of this final level
regressor is to beat the errors made by the bottom • R2 Score
learners and to spice up the general accuracy. Here,
the stacking algorithm used is a hybrid algorithm R-Squared is a statistical measure of fit that
using Random Forest Regressor, Decision Tree indicates how much variation of a dependent
Regressor, Logistic Regression and SVM. variable is explained by the independent
In this paper we have used a combination of 3 variable(s) in a regression model.
generic algorithms augmenting one another to solve
problems they are not designed to solve.
Since most machine learning algorithms are
designed for a particular dataset or task, combining
multiple ML algorithms can greatly improve the Its scale is intuitive: it ranges from 0 to 1.
overall result by either helping tune one another, Zero indicating that the proposed model does not
generalize, or adapt to unknown tasks. improve prediction over the mean model, and
One indicating perfect prediction. Better the
model, higher the R-squared.
• MSE Score

MSE is the average of the squared error that is used


as the loss function for least squares regression: It is
the sum, over all the data points, of the square of the
difference between the predicted and actual target
variables, divided by the number of data points.

Fig 5: Delayed flights representation

Fig 5 represents the analysis of Actual vs


Predicted number of flights that are delayed for
United States.
The blue bar shows the actual delay of the flights
and green bar shows the predicted flight delay on
basis of R2 score for SVM model.

Fig 3: R2 score for SVM model – Departure


Delay

Fig 6: Comparative analysis of training and


testing data of R2 score

Fig 4: R2 score for SVM model – Arrival Delay In fig 6, we have represented the R2 score for
both training and testing dataset using SVM
Fig 3 and fig 4 represent implementation of SVM model.
model and the R2 score achieved for departure delay The Blue dots represent Training results and
and arrival delay is 0.18 and 0.25 respectively. Green dots represent Testing results.
DEPARTURE DELAY VIII. ACKNOWLEDGEMENT
TRAINING TESTING
R2 R2 This paper and the research behind it would not
REGRESSOR MSE MSE have been possible without the exceptional support
score score
SVM 0.18 31.63 0.18 32.00 of our project guide and supervisor, Mr. Abhay
DECISION
Patil. His enthusiasm, knowledge and exact
TREE
1.0 0.0 0.32 28.98 attention to detail have been an inspiration and kept
STACKING 0.86 12.64 0.05 34.19 the work on track from the beginning. We would
also like to thank our friends and family who
Table 3: Departure Delay scores
supported us and also offered deep insight into the
study.
ARRIVAL DELAY
TRAINING TESTING IX. REFERENCES
R2 R2
REGRESSOR MSE MSE [1] Chakrabarty, Navoneel, et al. ”Flight Arrival
score score
SVM 0.33 32.17 0.34 13.99 Delay Prediction Using Gradient Boosting
DECISION Classifier.” Emerging Technologies in Data
1.0 0.0 0.09 36.27
TREE Mining and Information Security. Springer,
STACKING 0.88 12.7 0.20 33.87 Singapore, 2019.
Table 4: Arrival Delay scores [2] Suvojit Manna, Sanket Biswas, Riyanka Kundu,
Somnath Rakshit, Priti Gupta, Subhas Burman
Table 3 and Table 4 summarizes the accuracy and ”A statistical approach to predict flight delay
precision of R2 score and MSE after implementation using gradient boosted decision tree”,
of the three algorithms such as SVM, Decision International Conference on Computational
Tree, Stacking depicting departure and arrival Intelligence in Data Science(ICCIDS), 2017.
delays. [3] Juan Jose Robollo and Hamsa Balakrishnan
”Characterization and Prediction of Air Traffic
VII. FUTURE SCOPE AND CONCLUSION Delays”.
[4] Sruti Oza, Somya Sharma, Hetal Sangoi, Rutuja
Implementation of the SVM model given the set of Raut, V.C. Kotak ”Flight Delay Prediction
attributes (feature values), is able to accurately predict System Using Weighted Multiple Linear
the Arrival Delay and Departure Delay if an aircraft Regression”, International
travelling from a specific origin to a destination with Journal Of Engineering And Computer Science
a specified set of parameters will arrive on time or get ISSN:2319-7242, Volume 4 Issue 4 April 2015.
delayed, with an accuracy of 0.18 (Departure Delay) [5] Anish M. Kalliguddi and Aera K. Leboulluec
and 0.33 (Arrival Delay) An accuracy near to 1 ”Predictive Modeling of Aircraft Flight Delay”,
succinctly proves the efficiency of this model, for our Universal Journal of Management 5(10): 485-
purpose. Thus, this satiates our requirement of 491, 2017, DOI: 10.13189/ujm.2017.051003
determining the delay for any given aircraft, given [6] Brett Naul ”Airline Departure Delay
merely the parameters of it. Prediction”
[7] Young Jin Kim, Sun Choi, Simon Briceno,
Dimitri Mavris ”A deep learning approach to
The Future Scope of this work involves the appliance
flight delay prediction”, 35th Digital Avionics
of additional advanced and novel pre-processing
Systems Conference (DASC), 2016.
techniques, Machine Learning-Deep Learning Hybrid
[8] Sina Khanmohammadi, Salih Tutun,
Models tuned with Grid rummage around for
Yunus Kucuk ”A New Multilevel Input Layer
achieving higher model performance
Artificial Neural Network for Predicting Flight
Delays at JFK
Airport”, doi.org/10.1016/j.procs.2016.09.321
[9] Loris Belcastro, Fabrizio Marozzo, Domenico
Talia and Paolo Trunfion ”Using Scalable Data
Mining for Predicting Flight Delays”
[10] https://en.wikipedia.org/wiki/Flight cancellati
on and delay.

You might also like