Main Summary

2023/9/15 16:36 main summary
Introduction
In this project, the 2009 ASA Statistical Computing and Graphics Dataset is used.
Because the dataset is very large (contains the year between 1987 and 2008),
I choose 2006, 2007 data as sub-dataset for the tasks.
This project mainly focus on 5 problems
1. When is the best time of day, day of the week, and time of year to fly to minimise delays?
2. Do older planes suffer more delays?
3. How does the number of people flying between different locations change over time?
4. Can you detect cascading failures as delays in one airport create delays in others?
5. Use the available variables to construct a model that predicts delays.
Library introduction
pandas: process csv data
numpy: process matrix
matplotlib: data visualization
sklearn: training prediction model
Data wrangling
In [80]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
Load flight, plane, airport data
In [11]:
flight_2007 = pd.read_csv('./dataverse_files/2007.csv')
flight_2006 = pd.read_csv('./dataverse_files/2006.csv')
flight = pd.concat([flight_2006, flight_2007])
In [12]:
airport = pd.read_csv('./dataverse_files/airports.csv')
In [13]:
plane = pd.read_csv('./dataverse_files/plane-data.csv')
flight data
localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 1/19

2023/9/15 16:36 main summary
In [14]:
print(flight.shape)
print(flight.columns)
(14595137, 29)
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
dtype='object')
plane data
In [15]:
print(plane.shape)
print(plane.columns)
(5029, 9)
Index(['tailnum', 'type', 'manufacturer', 'issue_date', 'model', 'status',
'aircraft_type', 'engine_type', 'year'],
dtype='object')
airport data
In [16]:
print(airport.shape)
print(airport.columns)
(3376, 7)
Index(['iata', 'airport', 'city', 'state', 'country', 'lat', 'long'], dtype='obj
ect')
Question 1
When is the best time of day, day of the week, and time of year to fly to minimise delays?
Join the flight data with the plain data on tailnum.

Get the hour of the day by dividing CRSDepTime with 100. Because the origin format is HHSS.
Group the delay according to 'day of month', 'day of week' and 'hour of day'. Calculate the delay mean,
median and average.
Plot the delay and find the best time.
Use line plot and box plot to display the median and mean delay of different time.
We mainly focus on the median delay of different time.

2023/9/15 16:36 main summary
In [141]:
flight['CRSDepHourOfDay'] = flight['CRSDepTime'] // 100

flight['DelayMin'] = flight['DepDelay'] / 60
In [135]:
dom_delay = flight.groupby('DayofMonth').agg({'DepDelay': ['sum', 'mean', 'median']}).reset_inde

dow_delay = flight.groupby('DayOfWeek').agg({'DepDelay': ['sum', 'mean', 'median']}).reset_index
hod_delay = flight.groupby('CRSDepHourOfDay').agg({'DepDelay': ['sum', 'mean', 'median']}).reset
In [136]:
dom_delay_np = dom_delay.to_numpy()
dow_delay_np = dow_delay.to_numpy()
hod_delay_np = hod_delay.to_numpy()

2023/9/15 16:36 main summary
In [26]:
ax1 = plt.subplot(3, 1, 1)
l1, = ax1.plot(dom_delay_np[:,0], dom_delay_np[:,3], label='median')
ax11 = ax1.twinx()
l11, = ax11.plot(dom_delay_np[:,0], dom_delay_np[:,2], color='orange', label='mean')
# ax111 = ax1.twinx()
# l111, = ax111.plot(dom_delay_np[:,0], dom_delay_np[:,3], color='red', label='median')
plt.legend([l1, l11], ['median', 'mean'])

l2, = ax2.plot(dow_delay_np[:,0], dow_delay_np[:,3], label='median')
ax21 = ax2.twinx()
l21, = ax21.plot(dow_delay_np[:,0], dow_delay_np[:,2], color='orange', label='mean')
# l211, = ax211.plot(dow_delay_np[:,0], dow_delay_np[:,3], color='red', label='median')

l3, = ax3.plot(hod_delay_np[:,0], hod_delay_np[:,3], label='median')
ax31 = ax3.twinx()
l31, = ax31.plot(hod_delay_np[:,0], hod_delay_np[:,2], color='orange', label='mean')
# l311, = ax311.plot(hod_delay_np[:,0], hod_delay_np[:,3], color='red', label='median')
Out[26]:
<matplotlib.legend.Legend at 0x17fa320d0>

2023/9/15 16:36 main summary
In [153]:
plt.clf()
ax = sns.boxplot(x='DayofMonth', y='DelayMin', data=flight, showfliers=False)
plt.show()

2023/9/15 16:36 main summary
In [152]:
plt.clf()
ax = sns.boxplot(x='CRSDepHourOfDay', y='DelayMin', data=flight, showfliers=False)
plt.show()

2023/9/15 16:36 main summary
In [156]:
plt.clf()
ax = sns.boxplot(x='DayOfWeek', y='DelayMin', data=flight, showfliers=False)
plt.show()
As shown in the figure, we mainly focus on the median dep_delay group by time.
Best time of day is: 5 o'clock, because 5 o'clock has least maximum delay value and lower average
delay.
Best day of week is: 2 because week 2 wih lower median and average delay.
Best day of month is: 6, 8, 9. All of them has least maximum delay value and lower average and
median delay. But the difference between them cannot be shown from the box plot.
Question2
Do older planes suffer more delays?
Import the plane data and drop NA values

Join the plane data and the flight data on 'tailnum' column
Group the delay according to the plane's 'year' column, calculate the delay's mean, average and
median.
Plot the delay and the year with line plot, find if there exists a relationship that older planes suffer more
delays.

2023/9/15 16:36 main summary
In [27]:
plane = pd.read_csv('./dataverse_files/plane-data.csv')
plane = plane.dropna()
flight_plane = flight.join(plane.set_index('tailnum'), on='TailNum')
In [28]:
plane_delay = flight_plane.groupby('year').agg({'DepDelay': ['sum', 'mean', 'median']})

plane_delay_np = plane_delay.reset_index().to_numpy()
In [171]:
l1 = ax1.scatter(plane_delay_np[:,0], plane_delay_np[:,1], label='sum')
ax1.tick_params(axis='x', rotation=90)
ax11 = ax1.twinx()
l11 = ax11.scatter(plane_delay_np[:,0], plane_delay_np[:,2], color='orange', label='mean')
plt.legend([l1, l11], ['sum', 'mean'])
Out[171]:
<matplotlib.legend.Legend at 0x109dfbf10>

2023/9/15 16:36 main summary
In [194]:
from scipy.stats.stats import pearsonr

plane_delay_np_noNone = plane_delay_np[(plane_delay_np[:, 0] != 'None'), :]
plane_delay_np_noNone = plane_delay_np_noNone[(plane_delay_np_noNone[:, 0] != '0000'), :]
plane_delay_np_noNone[:, 0] = plane_delay_np_noNone[:, 0].astype(np.int32)
r, p = pearsonr(plane_delay_np_noNone[:,0], plane_delay_np_noNone[:,2])
print(f'Pearson correlation coefficient: {r}, Two-tailed p-value: {p}')
Pearson correlation coefficient: 0.1554534236860175, Two-tailed p-value: 0.29140

08009962324
/var/folders/f1/nhl1t6rs2mq050hfbt4c7dxw0000gn/T/ipykernel_11438/760284361.py:1:
DeprecationWarning: Please use `pearsonr` from the `scipy.stats` namespace, the
`scipy.stats.stats` namespace is deprecated.
from scipy.stats.stats import pearsonr
As shown in the figure, we mainly focus on the mean dep_delay group by plane's issue year.
There are no clearly relation between manufacturing year and mean dep_delay time. So older planes
do not suffer more delays.
The Pearson correlation coefficient is 0.155, which means there are no linear correlation between
manufacturing year and delay.
Question3
How does the number of people flying between different locations change over time?
In this question, I mainly uses two approach to analyze the problem.

1. Find the locations with top-5 flight number and find the change of the number of people according
to months (line plot).
2. Find the change of number of people with heat map according to the geolocation of the airport.
Because the airports are too much, use latitude and longtitude region is a better way (heat map).
For approach1
First combine the origin and destination with the same order, for example, treat (BOS,LGA) and
(LGA, BOS) as the same key.
Group the data with the origin dest tuple and sort with the count.
Get the top-5 origin dest tuple
Plot the change of those locations and find the relationship
For approach2
First join the flight data and the airport data with destination and divide the data into latitude and
longtitude slots with space 5 according to the airport's location.
Count the flight number in each slot and plot them in a heat map.
In [30]:
flight['SortedOriginDest'] = flight.apply(lambda row : ','.join(sorted([row['Origin'], row['Des

N = 5
max_n = flight.groupby(['SortedOriginDest'], as_index=False).size().sort_values(by='size', asce
max_n_OD = list(max_n['SortedOriginDest'])

2023/9/15 16:36 main summary
In [31]:
ODYear = flight[flight['SortedOriginDest'].isin(max_n_OD)]
ODYear = ODYear[ODYear['Year'].isin([2006, 2007])]
ODYear = ODYear.groupby(['SortedOriginDest', 'Month'], as_index=False)
ODYear_cnt = ODYear['SortedOriginDest'].size()
In [32]:
plot_dic = {}
for od in max_n_OD:
plot_dic[od] = ODYear_cnt[ODYear_cnt['SortedOriginDest'] == od]
plt.plot(plot_dic[od]['Month'], plot_dic[od]['size'], label=od)
plt.legend()
Out[32]:
<matplotlib.legend.Legend at 0x17fc5d4f0>
In this approach, I have choose the top-5 flight, group by both direction of the Origin and Destination
city.
For example, flight with direction (HNL, OGG) and (OGG, HNL) are count as the same 'locations'.
As shown in the figure, for the flight with dest or origin of HNL, Feburary has least people flying and
July has most people flying.
All location has a decrease in February and an increase in March.
People flying in summer is more than people flying in winter.

2023/9/15 16:36 main summary
In [95]:
flight['my'] = flight['Year'].astype(str) + '-' + flight['Month'].astype(str).str.zfill(2)

flight_airport = pd.merge(flight[['my', 'Origin']], airport[['iata', 'lat', 'long']], left_on='O
slot = 5
flight_airport['round_lat'] = flight_airport['lat'] // slot * slot
flight_airport['round_long'] = flight_airport['long'] // slot * slot

lat_range = np.arange(flight_airport['lat'].min() // slot * slot, flight_airport['lat'].max() /
long_range = np.arange(flight_airport['long'].min() // slot * slot, flight_airport['long'].max(

dates = flight_airport['my'].unique()
dates = sorted(dates)

lats = np.zeros((len(lat_range), len(dates)))
longs = np.zeros((len(long_range), len(dates)))

flight_airport = flight_airport[['my', 'round_lat', 'round_long', 'Origin']]

lats_g = flight_airport.groupby(['my', 'round_lat']).count().reset_index()
longs_g = flight_airport.groupby(['my', 'round_long']).count().reset_index()

In [96]:

for _, row in lats_g.iterrows():
lats[(int)(row['round_lat'] // slot - flight_airport['round_lat'].min() // slot), dates.ind
for _, row in longs_g.iterrows():
longs[(int)(row['round_long'] // slot - flight_airport['round_long'].min() // slot), dates.

2023/9/15 16:36 main summary
In [98]:
ax = sns.heatmap(lats, linewidths=0.5, cmap='coolwarm')

ax.set_xlabel('yyyy-MM')
ax.set_xticks(ticks = range(len(dates)), labels=dates)
ax.set_ylabel('latitude')
ax.set_yticks(ticks = range(len(lat_range)), labels=lat_range)
plt.show()
(12, 24)

2023/9/15 16:36 main summary
In [99]:
ax = sns.heatmap(longs, linewidths=0.5, cmap='coolwarm')

ax.set_xlabel('yyyy-MM')
ax.set_xticks(ticks = range(len(dates)), labels=dates)
ax.set_ylabel('longitude')
ax.set_yticks(ticks = range(len(long_range)), labels=long_range)
plt.show()
In this approach,
As shown in the figures, the numbers of passengers changes are not much for the origin of the flights.
The heatmap's color shows the number of passengers changes with respset to latitude and longitude.
The color of each slot does not change too much according to time.
Question4
Can you detect cascading failures as delays in one airport create delays in others?
Group the flight data by the time and destination attribute.

Aggegrate the mean and median value of the departure delay attribute.
Join the flight according to the destination and origin attribute within the same day.
For example, one flight from LA to DC will join with another record from DC to another place on
the same day.
Because the data is too large, I only choose the data in 2006, Jan
Calculate the ratio between the destination and origin's departure delay, if the ratio is steady and close
to 1.
Use the scatter plot to plot the ratio and time in different airports.
2023/9/15 16:36 main summary
In [127]:
flight_delay = flight[flight['DepDelay'] > 0]

flight_sub_dest_grouped = flight_delay.groupby(['Year', 'Month', 'DayofMonth', 'Dest'], as_index
flight_sub_orig_grouped = flight_delay.groupby(['Year', 'Month', 'DayofMonth', 'Origin'], as_ind
flight_sub_joined = pd.merge(flight_sub_dest_grouped, flight_sub_orig_grouped, how='inner', left
/var/folders/f1/nhl1t6rs2mq050hfbt4c7dxw0000gn/T/ipykernel_11438/3901546009.py:
4: PerformanceWarning: dropping on a non-lexsorted multi-index without a level p
arameter may impact performance.
flight_sub_joined = pd.merge(flight_sub_dest_grouped, flight_sub_orig_grouped,
how='inner', left_on=['Year', 'Month', 'DayofMonth', 'Dest'], right_on=['Year',
'Month', 'DayofMonth', 'Origin'])
In [131]:
flight_sub_joined = flight_sub_joined[flight_sub_joined['Year'] == 2006]

flight_sub_joined = flight_sub_joined[flight_sub_joined['Month'] == 1]
# flight_sub_joined = flight_sub_joined[flight_sub_joined['DayofMonth'] <= 10]

flight_sub_joined['ymd'] = flight_sub_joined['Year'].astype(str) + flight_sub_joined['Month'].as
flight_sub_joined['ratio'] = flight_sub_joined['DepDelay_x']['median'] / flight_sub_joined['DepD

2023/9/15 16:36 main summary
In [132]:
plt.clf()
ax = sns.scatterplot(x='ymd', y='ratio', hue='Dest', data=flight_sub_joined, legend=None)
plt.xticks(rotation=90)
plt.show()

2023/9/15 16:36 main summary
In [133]:

plt.clf()
ax = sns.scatterplot(x='ymd', y='ratio', hue='Dest', data=flight_sub_joined[flight_sub_joined['r
plt.xticks(rotation=90)
plt.show()
As shown in the figure, the ratio of the origin and destination delay does not have clearly relationship,
the ratio is not close to 1.
So there is no cascading failures as delays in one airport create delays in others.
Question5
Use the available variables to construct a model that predicts delays.
Divide the dataset into two groups: delay and no delay

Join the flight data with the airport, with both origin and destination airport info.
Choose time, distance and location as attribute
"Year", "Month", "DayOfWeek", "CRSDepTime", "CRSArrTime", "CRSElapsedTime", "Distance",
"lat_x", "long_x", "lat_y", "long_y"
Downsample the data with 20000 items, because the origin dataset is too large.
Divide the dataset into train and test with 7 : 3 ratio.
Train the model of random forest
Get the test result

2023/9/15 16:36 main summary
In [157]:
airport = pd.read_csv('./dataverse_files/airports.csv')
sampled_flight = flight.sample(20000)
flight_ori_airport = pd.merge(sampled_flight, airport, left_on="Origin", right_on="iata")
flight_src_airport = pd.merge(flight_ori_airport, airport, left_on="Dest", right_on="iata")
flight_src_airport['delay'] = flight_src_airport['ActualElapsedTime'] - flight_src_airport['CRSE
In [158]:
flightds = flight_src_airport[["Year", "Month", "DayOfWeek", "CRSDepTime", "CRSArrTime", "CRSEl
In [159]:
flightds['delay'] = np.where(flight_src_airport['delay'] > 0, 1, 0)

flightds.dropna(inplace=True)
1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab

le/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pandas.pydat
a.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-cop
y)
flightds['delay'] = np.where(flight_src_airport['delay'] > 0, 1, 0)
2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab

le/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pandas.pydat
a.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-cop
y)
flightds.dropna(inplace=True)
In [167]:
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, confusion_matrix, roc_curv

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
In [162]:
X_train, X_test, Y_train, Y_test = train_test_split(flightds[["Year", "Month", "DayOfWeek", "CRS

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

2023/9/15 16:36 main summary
In [163]:
clf = RandomForestClassifier()
model = clf.fit(X_train, Y_train)
In [164]:
Y_pred = model.predict(X_test)
cm = confusion_matrix(Y_test, Y_pred, labels = model.classes_)
cd = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=model.classes_)
cd.plot()
print(f'Acc: {accuracy_score(Y_test, Y_pred)}')
Acc: 0.6371666666666667

2023/9/15 16:36 main summary
In [169]:
fpr, tpr, thre = roc_curve(Y_test, Y_pred)

roc_auc = auc(fpr, tpr)
plt.clf()
plt.plot(fpr, tpr, 'b', label = f'AUC = {roc_auc:0.2}')
plt.ylabel('TP rate')
plt.xlabel('FP rate')
Out[169]:
Text(0.5, 0, 'FP rate')
The model has accuracy of 63%.

There are a lot of FNs.

Main Summary

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main Summary

Uploaded by

Copyright:

Available Formats

2023/9/15 16:36 main summary

Load flight, plane, airport data

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 1/19

Join the flight data with the plain data on tailnum.

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 2/19

flight['CRSDepHourOfDay'] = flight['CRSDepTime'] // 100

dom_delay = flight.groupby('DayofMonth').agg({'DepDelay': ['sum', 'mean', 'median']}).reset_inde

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 3/19

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 4/19

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 5/19

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 6/19

Import the plane data and drop NA values

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 7/19

plane_delay = flight_plane.groupby('year').agg({'DepDelay': ['sum', 'mean', 'median']})

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 8/19

from scipy.stats.stats import pearsonr

Pearson correlation coefficient: 0.1554534236860175, Two-tailed p-value: 0.29140

In this question, I mainly uses two approach to analyze the problem.

flight['SortedOriginDest'] = flight.apply(lambda row : ','.join(sorted([row['Origin'], row['Des

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 9/19

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 10/19

flight['my'] = flight['Year'].astype(str) + '-' + flight['Month'].astype(str).str.zfill(2)

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 11/19

ax = sns.heatmap(lats, linewidths=0.5, cmap='coolwarm')

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 12/19

ax = sns.heatmap(longs, linewidths=0.5, cmap='coolwarm')

Group the flight data by the time and destination attribute.

flight_delay = flight[flight['DepDelay'] > 0]

flight_sub_joined = flight_sub_joined[flight_sub_joined['Year'] == 2006]

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 14/19

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 15/19

Divide the dataset into two groups: delay and no delay

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 16/19

flightds = flight_src_airport[["Year", "Month", "DayOfWeek", "CRSDepTime", "CRSArrTime", "CRSEl

flightds['delay'] = np.where(flight_src_airport['delay'] > 0, 1, 0)

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab

from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, confusion_matrix, roc_curv

X_train, X_test, Y_train, Y_test = train_test_split(flightds[["Year", "Month", "DayOfWeek", "CRS

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 17/19

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 18/19

fpr, tpr, thre = roc_curve(Y_test, Y_pred)

The model has accuracy of 63%.

localhost:8890/notebooks/Desktop/year2/ST219570/ST2195 coursework final/python jupyter/main summary.ipynb# 19/19

You might also like