0% found this document useful (0 votes)
35 views25 pages

Report

Uploaded by

jiyaa8407
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views25 pages

Report

Uploaded by

jiyaa8407
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

D37

1
INDEX

1. Introduction
2. Objectives
3. Data
3.1 Data Scraping
3.2 Data Description
4. Model Approach
5. Exploratory Data Analysis
5.1 Network Graph
5.2 Radar Plot
5.3 Distribution Analysis
5.4 Feature Relationships
5.5 Weather Contributions
6. Time Series Analysis
6.1 Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)
6.2 FFT Analysis
7. Feature Engineering
7.1 Flight Record Encoding
8. Model Training
8.1 Using Time Series Models
8.2 Using ML Models
8.3 Using Flight Record Encoder
8.4 Using Neural Networks
9. Explainability
10. Flight Rescheduling
11. Results
12. Conclusion
Annexure

2
INTRODUCTION

Airlines face a complex challenge in maximizing profitability while managing multiple


operational areas, such as flight scheduling, ground crew efficiency, fuel management, and
passenger services. Given strict regulations and fluctuating fuel costs, predicting these delays
and optimizing these processes is essential for minimizing delays, improving turnaround
times, and boosting overall performance. Airlines must carefully plan maintenance schedules
to minimize downtime without compromising safety. Balancing preventive maintenance with
revenue-generating flights is delicate.

OBJECTIVES

1. Predict Flight Delays: Develop predictive models to forecast potential delays in


flight schedules, enabling proactive resource adjustments to minimize disruptions.
2. Optimize Flight Rescheduling: Design a rescheduling model that minimizes
cumulative delays by considering constraints like aircraft availability, crew schedules,
and regulatory requirements.
3. Enhance Operational Efficiency: Use data-driven techniques to improve overall
turnaround times and reduce fuel consumption without compromising safety or
service quality

DATA

3.1 Data Scraping Process

● Monthly data on flight operations, delays, and airline ratings of the different airlines were
extracted in CSV file format from the Bureau of Transportation Statistics (BTS) website:
https://www.transtats.bts.gov.

● For extra weather data, Python packages such as Geopy, OpenCage, and Meteostat were
utilized. These tools were used to gather longitude and latitudes of the airports, to further
extract location-specific weather details, including average temperature, atmospheric
pressure, wind speed, and precipitation, from reliable meteorological sources.

3.2 Data Description

Dataset Overview:

3
● We have used the dataset corresponding to flight schedule and delay delays for the
United States of America.
● Number of records (rows):We have a total of 5,48,685 rows,which signify the
monthly data over a period from 2022 to 2024.

Data Variables:

● List of variables(column) in the dataset: [‘Year’, ’FlightDate’,


‘DOT_ID_Reporting_Airline’, ’Tail_Number’, ’Flight_Number_Reporting_Airline’,
‘OriginAirportID’, ’OriginCityName’, ’OriginWac’, ’DestAirportID’,
’DestCityName’, ‘DestWac’, ’CRSDepTime’, ’DepTime’, ’DepDelayMinutes’,
‘DepartureDelayGroups’, ’TaxiOut’, ’WheelsOff’, ’WheelsOn‘, ’TaxiIn’,
’CRSArrTime’, ‘ArrTime’, ’ArrDelayMinutes’, ’Diverted’, ’CRSElapsedTime’,
’ActualElapsedTime’, ‘AirTime’, ’FlightsDistance’, ’CarrierDelay’, ‘WeatherDelay‘,
’NASDelay’, ‘SecurityDelay’, ’LateAircraftDelay’, ‘DivAirportLandings’,
‘CancellationCode_encoded’, ‘index’, ’tavg’, ‘tmin’, ’tmax’, ‘prcp’, ‘snow’, ’wdir’,
‘wspd’, ’wpgt’, ‘pres’, ’tsun’, ‘airline_ratings’].

MODEL APPROACH

4
EXPLORATORY DATA ANALYSIS

5.1 Network Graph

Network Graph of Delayed Flights

Description: The network graph shows how the


airports (yellow coloured nodes) are connected via
flight routes. It has been shown on the map of the USA
for clearer understanding.

Observations:

1. The colors of the edges between two nodes depict the average delay along those routes.

a.The purple-colored edges denote a delay between 0-50 minutes.

b.The pink edges show an average delay of 50-100 minutes.

c.The yellow edge denotes a delay of more than 200 minutes.

5
5.2 Radar Plot

Description: A radar plot (or spider plot), showcasing the magnitudes by which different
features affect the value of Departure Delay has been shown. Observations:

1.For each flight, the factors affecting departure delay differ. So, K-Means Clustering is used
to cluster the data into different clusters which have different factors affecting the departure
delays with different weights.

2. An optimum value of 7 clusters was chosen using the elbow method.

3. In Cluster 0, as expected, arrival delay, arrival time, and departure time are significant in
predicting departure delays.

4. A unique observation is that the Destination Airports, Reporting Airlines and Aircrafts are
of higher importance. Similarly, Clusters 1 to 6 are given in the Annexure.

6
5.3 Distribution Analysis

Description: The distribution of departure delay has been shown. Observations:

1. More than 2,50,000 flights faced a delay between 0 to 100 minutes.


2. A very small percentage of flights got delayed by more than 100 minutes.

5.4 Feature Relationships

7
Description: This plot visualizes the relationships between ArrDelayMinutes (arrival delay),
‘tavg’ (average temperature), ‘prcp’ (precipitation), snow (snowfall), and ‘wspd’ (wind
speed).

Observations:

1. Each subplot shows the scatter plot for every pair of variables, highlighting potential
correlations or trends between flight delays and weather factors.
2. The diagonal plots display kernel density estimates (KDE) for each individual
variable, indicating their distributions.

5.5 Weather Contributions

Description: We can see


that the departure delays
are less when the
atmospheric pressure is
within the normal range
(1000 HectoPascals to
1020 HectoPascals), and
slowly increases as the
pressure increases above
the normal range or
decreases below it.

Description: The
departure delays are
less when the
temperature ranges
between 0℃ to 28℃.
The delay in flights
increases as the
temperature goes
beyond normal ranges.

8
TIME SERIES ANALYSIS

6.1 Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF):

The ACF and PACF graphs for scheduled departure time are shown. They show no correlation of
delay with the input features. Hence, there is no seasonality trend followed. The plots for other
features are given in the Annexure.

6.2 Fast Fourier Transforms Analysis

Fast Fourier Transforms are applied on the dataset to find any trends in the frequency
domain. As can be seen above, there are no significant trends.

9
FEATURE ENGINEERING

7.1 Flight Record Encoding

The flight record encoder is used to encode the features of the target flight record Rj and the
features related with the delay of its pre-order flight Fk. The inputs of the flight record
encoder are: the flight record Rj = < Fj,Wj >, the arrival delay dkt of the pre-order flight Fk,
and the pre-order flight gap g(Fk → Fj) between the scheduled departure time tak of Fj and the
actual arrival time of flight Fk.

The features R , dkt and g(Fk → Fj) can be separated into discrete and continuous features. As
shown in the figure, the flight record encoder encodes the discrete features into
high-dimensional dense vectors and then concatenates the embeddings of discrete features
with the continuous features. The concatenation is fed into a fusion layer to obtain the
representation of the flight record.

10
Evaluating Prediction Accuracy with MAE Score

MAE SCORE: Mean Absolute Error (MAE) is to predict continuous numerical values. It
measures the average absolute difference between the predicted values and the actual target
values. Unlike other metrics, MAE doesn’t square the errors, which means it gives equal
weight to all errors, regardless of their direction. This helps to understand the magnitude of
errors without considering whether they are overestimations or underestimations.

MAE is particularly well suited for predicting delay in flight departures, due to its
proficiency in quantifying the magnitude or errors in the predictive model. It quantifies the
prediction inaccuracy by considering both the size and direction of errors. Its suitability for
flight delay prediction arises from its capability to explain the error in the units of delay,
making the performance of the model more understandable.

Train-Test Split: Our dataset contains flight schedule and delay details starting from
01-01-2022 to 31-08-2024. For the evaluation of our model, we have used the data
corresponding to 2022 and 2023 as the training set, and the data from 2024 as the testing set.
The testing set contains 309 new routes, 6 new airports and 149 new aircrafts that were not
present in the training set. Thus, our prediction model generalizes well and can perform well
on data containing new routes, airports, and aircrafts.

MODEL TRAINING

8.1 Using Time Series Models

Utilized Time Series Models to predict the departure delays, focusing on datasets with
frequency encoded features. Models such as ARIMA and SARIMAX, as expected, gave poor
performances, owing to the lack of seasonal trend, as discussed in Exploratory Data Analysis
(Section 4).

MODEL MAE

ARIMA 50.4102

SARIMAX 73.9801

11
8.2 Using ML Models

Step 1: Individual feature prediction (X):

● For each feature column, we apply machine learning algorithms separately.


● The outcome of this step is the predicted values of each feature, which becomes our
new set of feature variables (X_pred).

Step 2: Departure delay prediction

● With the predicted feature values (X_pred), we subsequently apply another predictive
model or technique to predict the departure delays (y) based on these estimated
feature values.

We can observe that Machine Learning models give very high values of Mean Absolute
Errors. This could be due to the lack of flexibility in the models, due to which complex
non-linear relationships in the data could not be captured. ML models rely upon feature
engineering to a large extent.

MODEL MAE

LINEAR 53.1465

DECISION TREE 73.9801

ELASTICNET 53.5216

EXTRATREES 56.0891

RIDGE 51.7051

LGBM 53.4285

RANDOM FOREST 49.3948

LASSO 51.3386

XGBOOST 50.6907

GRADIENT BOOSTING 48.4953

12
8.3 Using Flight Record Encoder

Step 1: Individual feature prediction (X):

● The categorical columns like Tail Number, Origin City and Destination City were
converted to embeddings, as discussed in Feature Engineering (Section 7)
● For each feature column, we apply neural networks separately, and the outcome of
this step is the predicted values of each feature, which becomes our new set of feature
variables (X_pred).

Step 2: Departure delay prediction

● Since Convolutional Neural Network gives the lowest MAE score among all the
models, it is used to predict the features (X_pred).
● With the predicted feature values (X_pred), we subsequently apply another neural
network model to predict the departure delays (y) based on these estimated feature
values.

MAE Scores of Neural Networks for Step 1

MODEL MAE

LSTM 9.2648

CNN 8.0155

DNN 9.8621

CNN-LSTM 11.9860

GRU 9.8151

RESNET50 12.0592

13
MAE Scores of Neural Networks for Step 2

MODEL MAE

LSTM 24.0982

CNN 25.5936

DNN 26.5173

CNN-LSTM 25.4449

GRU 23.1388

RESNET50 22.5774

DL models gave better MAE values. As compared to ML models, Deep neural networks
perform better on large datasets.

Employing Flight Record Encoding on categorical features further improves the MAE scores
and reduces it to X, due to its proficiency in capturing the categorical data and their
relationships with the delay in departure time.

8.3 Using Neural Networks

Several Neural Networks were trained and evaluated using the 2-step procedure, as described
while using Machine Learning Models (Section 8.2).

MODEL MAE

CNN 26.1384

LSTM 23.9869

DNN 21.1841

CNN-LSTM 21.1173

GRU 24.5893

14
MODEL MAE

RESNET50 21.4540

The CNN-LSTM model is ideal for this dataset because it captures both spatial and temporal
dependencies, which are critical for accurately predicting delays. The convolutional layers in
the CNN component excel at extracting patterns from the feature set, isolating important
predictors like weather conditions, airport factors, and operational delays. This enhances the
feature representation before passing it to the LSTM component. The LSTM, designed for
sequential data, is well-suited to model the temporal dependencies in flight delay patterns,
such as cascading delays and seasonal trends. By combining CNN for feature extraction and
LSTM for sequence learning, this hybrid model can capture complex interactions and
time-dependent relationships that improve delay prediction accuracy over simpler models.
Additionally, it can leverage variations in daily and seasonal patterns, which are prevalent in
flight delay data.

FLIGHT RESCHEDULING

9.3 Simple Genetic Algorithm

A Simple Genetic Algorithm (SGA) is an optimization method inspired by the principles of


natural selection and genetics. SGA has been employed for rescheduling the flight, as it is
effective for scheduling and delay minimization tasks. With numerous constraints such as
aircraft availability, crew scheduling, and regulatory requirements, SGA can efficiently
explore a large search space of possible schedules by iteratively refining solutions. SGA’s
adaptability and explorations allow the algorithm to generate feasible, optimized schedules
that consider both immediate and future scenarios.

1. The algorithm uses a fitness function, to calculate the total delay for a given schedule,
given the scheduled and rescheduled times.
2. A random sample of 20 possible flight schedules is created using Population
Initialization.
3. The tournament selection method, using 3 tournaments, is used to choose parents for
crossover, which selects the best individuals from a subset.

15
4. The single-point crossover generates a new child schedule by combining two parent
schedules. A simple swap mutation is applied with a 0.1 mutation rate.
5. The algorithm runs for 100 generations, continually evolving the population to
improve fitness.

The optimization problem was solved for the entire test dataset (1st January 2024 to 31st
August 2024), last month of the test dataset (August 2024), and last week of the test dataset
(25th August 2024 to 31st August 2024). The total amount of initial delay, delay after
optimization (in minutes) and percentage improvement has been shown in the table below.

DATASET SIZE INITIAL OPTIMIZED IMPROVEMENT


DELAY DELAY (%)
(mins) (mins)

1 week 838 58,166 106 99.82

1 month 4322 3,06,241 3,100 98.99

8 months 32224 22,78,199 28,350 98.76

RESULTS AND CONCLUSION:

16
In the final analysis of our models' performance, the Mean Absolute Error
(MAE) scores for different approaches have been compared. The following
table displays the MAE scores for the top performing models:

1. Network graphs depict how the airports (yellow coloured nodes) are
connected via flight routes. We infer from it that most of the delays
are of less than 50 mins.
2. Radar plots derived from clustering show the impact of various
features on delay time. This helps us understand which features are
essentially important in predicting the delay.
3. The CNN-LSTM model captures spatial and temporal dependencies,
making it ideal for accurately predicting delays in complex,
time-sequenced data.
4. The test data contains 309 new routes, 6 new airports and 149 new
aircrafts that were not present in the training data. An MAE of 5
mins on the testing data shows that the model generalizes well on
new routes, airports and aircrafts.
5. A Simple Genetic Algorithm has been used to reschedule flights to
minimize delay, achieving a 98.8% reduction in delay within the test
dataset.

17
ANNEXURE:

18
Effect of Categorical Features on Departure Delay

19
Flights and Flight Delay Distribution across the Year

20
Effect of Weather Parameters on Departure Delay

21
Box Plots and Scatter Plots Suggesting Coherency between Departure and Arrival Delays

22
Effect of Categorical Features on Arrival Delay

23
Elbow Method To Find K and perform K-Means Clustering for Radar Plot

24
Radar Plots for Clusters 1 to 6

25

You might also like