Professional Documents
Culture Documents
Introduction
In this project, the 2009 ASA Statistical Computing and Graphics Dataset is used.
Because the dataset is very large (contains the year between 1987 and 2008),
I choose 2006, 2007 data as sub-dataset for the tasks.
This project mainly focus on 5 problems
1. When is the best time of day, day of the week, and time of year to fly to minimise delays?
2. Do older planes suffer more delays?
3. How does the number of people flying between different locations change over time?
4. Can you detect cascading failures as delays in one airport create delays in others?
5. Use the available variables to construct a model that predicts delays.
Library introduction
pandas: process csv data
numpy: process matrix
matplotlib: data visualization
sklearn: training prediction model
Data wrangling
In [80]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
In [11]:
flight_2007 = pd.read_csv('./dataverse_files/2007.csv')
flight_2006 = pd.read_csv('./dataverse_files/2006.csv')
flight = pd.concat([flight_2006, flight_2007])
In [12]:
airport = pd.read_csv('./dataverse_files/airports.csv')
In [13]:
plane = pd.read_csv('./dataverse_files/plane-data.csv')
flight data
In [14]:
print(flight.shape)
print(flight.columns)
(14595137, 29)
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
dtype='object')
plane data
In [15]:
print(plane.shape)
print(plane.columns)
(5029, 9)
Index(['tailnum', 'type', 'manufacturer', 'issue_date', 'model', 'status',
'aircraft_type', 'engine_type', 'year'],
dtype='object')
airport data
In [16]:
print(airport.shape)
print(airport.columns)
(3376, 7)
Index(['iata', 'airport', 'city', 'state', 'country', 'lat', 'long'], dtype='obj
ect')
Question 1
When is the best time of day, day of the week, and time of year to fly to minimise delays?
In [141]:
In [135]:
In [136]:
dom_delay_np = dom_delay.to_numpy()
dow_delay_np = dow_delay.to_numpy()
hod_delay_np = hod_delay.to_numpy()
In [26]:
ax1 = plt.subplot(3, 1, 1)
l1, = ax1.plot(dom_delay_np[:,0], dom_delay_np[:,3], label='median')
ax11 = ax1.twinx()
l11, = ax11.plot(dom_delay_np[:,0], dom_delay_np[:,2], color='orange', label='mean')
# ax111 = ax1.twinx()
# l111, = ax111.plot(dom_delay_np[:,0], dom_delay_np[:,3], color='red', label='median')
plt.legend([l1, l11], ['median', 'mean'])
ax2 = plt.subplot(3, 1, 2)
l2, = ax2.plot(dow_delay_np[:,0], dow_delay_np[:,3], label='median')
ax21 = ax2.twinx()
l21, = ax21.plot(dow_delay_np[:,0], dow_delay_np[:,2], color='orange', label='mean')
# ax211 = ax2.twinx()
# l211, = ax211.plot(dow_delay_np[:,0], dow_delay_np[:,3], color='red', label='median')
plt.legend([l2, l21], ['median', 'mean'])
ax3 = plt.subplot(3, 1, 3)
l3, = ax3.plot(hod_delay_np[:,0], hod_delay_np[:,3], label='median')
ax31 = ax3.twinx()
l31, = ax31.plot(hod_delay_np[:,0], hod_delay_np[:,2], color='orange', label='mean')
# ax311 = ax3.twinx()
# l311, = ax311.plot(hod_delay_np[:,0], hod_delay_np[:,3], color='red', label='median')
plt.legend([l3, l31], ['median', 'mean'])
Out[26]:
<matplotlib.legend.Legend at 0x17fa320d0>
In [153]:
plt.clf()
ax = sns.boxplot(x='DayofMonth', y='DelayMin', data=flight, showfliers=False)
plt.show()
In [152]:
plt.clf()
ax = sns.boxplot(x='CRSDepHourOfDay', y='DelayMin', data=flight, showfliers=False)
plt.show()
In [156]:
plt.clf()
ax = sns.boxplot(x='DayOfWeek', y='DelayMin', data=flight, showfliers=False)
plt.show()
As shown in the figure, we mainly focus on the median dep_delay group by time.
Best time of day is: 5 o'clock, because 5 o'clock has least maximum delay value and lower average
delay.
Best day of week is: 2 because week 2 wih lower median and average delay.
Best day of month is: 6, 8, 9. All of them has least maximum delay value and lower average and
median delay. But the difference between them cannot be shown from the box plot.
Question2
Do older planes suffer more delays?
In [27]:
plane = pd.read_csv('./dataverse_files/plane-data.csv')
plane = plane.dropna()
flight_plane = flight.join(plane.set_index('tailnum'), on='TailNum')
In [28]:
In [171]:
ax1 = plt.subplot(1, 1, 1)
l1 = ax1.scatter(plane_delay_np[:,0], plane_delay_np[:,1], label='sum')
ax1.tick_params(axis='x', rotation=90)
ax11 = ax1.twinx()
l11 = ax11.scatter(plane_delay_np[:,0], plane_delay_np[:,2], color='orange', label='mean')
plt.legend([l1, l11], ['sum', 'mean'])
Out[171]:
<matplotlib.legend.Legend at 0x109dfbf10>
In [194]:
As shown in the figure, we mainly focus on the mean dep_delay group by plane's issue year.
There are no clearly relation between manufacturing year and mean dep_delay time. So older planes
do not suffer more delays.
The Pearson correlation coefficient is 0.155, which means there are no linear correlation between
manufacturing year and delay.
Question3
How does the number of people flying between different locations change over time?
In [30]:
In [31]:
ODYear = flight[flight['SortedOriginDest'].isin(max_n_OD)]
ODYear = ODYear[ODYear['Year'].isin([2006, 2007])]
ODYear = ODYear.groupby(['SortedOriginDest', 'Month'], as_index=False)
ODYear_cnt = ODYear['SortedOriginDest'].size()
In [32]:
plot_dic = {}
for od in max_n_OD:
plot_dic[od] = ODYear_cnt[ODYear_cnt['SortedOriginDest'] == od]
plt.plot(plot_dic[od]['Month'], plot_dic[od]['size'], label=od)
plt.legend()
Out[32]:
<matplotlib.legend.Legend at 0x17fc5d4f0>
In this approach, I have choose the top-5 flight, group by both direction of the Origin and Destination
city.
For example, flight with direction (HNL, OGG) and (OGG, HNL) are count as the same 'locations'.
As shown in the figure, for the flight with dest or origin of HNL, Feburary has least people flying and
July has most people flying.
All location has a decrease in February and an increase in March.
People flying in summer is more than people flying in winter.
In [95]:
In [96]:
for _, row in lats_g.iterrows():
lats[(int)(row['round_lat'] // slot - flight_airport['round_lat'].min() // slot), dates.ind
for _, row in longs_g.iterrows():
longs[(int)(row['round_long'] // slot - flight_airport['round_long'].min() // slot), dates.
In [98]:
(12, 24)
In [99]:
In this approach,
As shown in the figures, the numbers of passengers changes are not much for the origin of the flights.
The heatmap's color shows the number of passengers changes with respset to latitude and longitude.
The color of each slot does not change too much according to time.
Question4
Can you detect cascading failures as delays in one airport create delays in others?
In [127]:
/var/folders/f1/nhl1t6rs2mq050hfbt4c7dxw0000gn/T/ipykernel_11438/3901546009.py:
4: PerformanceWarning: dropping on a non-lexsorted multi-index without a level p
arameter may impact performance.
flight_sub_joined = pd.merge(flight_sub_dest_grouped, flight_sub_orig_grouped,
how='inner', left_on=['Year', 'Month', 'DayofMonth', 'Dest'], right_on=['Year',
'Month', 'DayofMonth', 'Origin'])
In [131]:
In [132]:
plt.clf()
ax = sns.scatterplot(x='ymd', y='ratio', hue='Dest', data=flight_sub_joined, legend=None)
plt.xticks(rotation=90)
plt.show()
In [133]:
plt.clf()
ax = sns.scatterplot(x='ymd', y='ratio', hue='Dest', data=flight_sub_joined[flight_sub_joined['r
plt.xticks(rotation=90)
plt.show()
As shown in the figure, the ratio of the origin and destination delay does not have clearly relationship,
the ratio is not close to 1.
So there is no cascading failures as delays in one airport create delays in others.
Question5
Use the available variables to construct a model that predicts delays.
In [157]:
airport = pd.read_csv('./dataverse_files/airports.csv')
sampled_flight = flight.sample(20000)
flight_ori_airport = pd.merge(sampled_flight, airport, left_on="Origin", right_on="iata")
flight_src_airport = pd.merge(flight_ori_airport, airport, left_on="Dest", right_on="iata")
flight_src_airport['delay'] = flight_src_airport['ActualElapsedTime'] - flight_src_airport['CRSE
In [158]:
In [159]:
/var/folders/f1/nhl1t6rs2mq050hfbt4c7dxw0000gn/T/ipykernel_11438/1716332030.py:
1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
In [167]:
In [162]:
In [163]:
clf = RandomForestClassifier()
model = clf.fit(X_train, Y_train)
In [164]:
Y_pred = model.predict(X_test)
cm = confusion_matrix(Y_test, Y_pred, labels = model.classes_)
cd = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=model.classes_)
cd.plot()
print(f'Acc: {accuracy_score(Y_test, Y_pred)}')
Acc: 0.6371666666666667
In [169]:
Out[169]:
Text(0.5, 0, 'FP rate')