Lab 4

lab4 - Jupyter Notebook 1/24/23, 5:12 PM
In [1]: import pandas as pd
In [2]: df = pd.read_csv('flights.csv')
df.head()
Out[2]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
0 2013 1 1 517.0 515 2.0 830.0 819 11.0
1 2013 1 1 533.0 529 4.0 850.0 830 20.0
2 2013 1 1 542.0 540 2.0 923.0 850 33.0
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0
1 Sort (ascending and descending) a variable

having missing values. After sorting check
where missing value are placed (top or
bottom?)
http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 1 of 14
In [5]: df.isna().sum()
Out[5]: year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64
In [7]: # sorting based on dep_time column

sorted_df = df.sort_values(by=['dep_time'], ascending=True)
sorted_df
Out[7]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
2013 7 1 1.0 2029 212.0 236.0 2359 157.0
2013 12 30 1.0 2359 2.0 441.0 437 4.0
2013 6 20 1.0 2359 2.0 340.0 350 -10.0
2013 5 22 1.0 1935 266.0 154.0 2140 254.0
2013 5 25 1.0 2359 2.0 336.0 341 -5.0
... ... ... ... ... ... ... ... ...
2013 9 30 NaN 1455 NaN NaN 1634 NaN
2013 9 30 NaN 2200 NaN NaN 2312 NaN
2013 9 30 NaN 1210 NaN NaN 1330 NaN
2013 9 30 NaN 1159 NaN NaN 1344 NaN
2013 9 30 NaN 840 NaN NaN 1020 NaN
336776 rows × 19 columns
In ascending sorting, NaN values are at the bottom
2 Sort flights to find the most delayed flights.

Find the flights that left earliest.
In [8]: # sorting based on dep_time column
sorted_df = df.sort_values(by=['dep_delay'], ascending=True)
sorted_df
Out[8]:
day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
7 2040.0 2123 -43.0 40.0 2352 48.0 B6 97 N592JB
3 2022.0 2055 -33.0 2240.0 2338 -58.0 DL 1715 N612DL
10 1408.0 1440 -32.0 1549.0 1559 -10.0 EV 5713 N825AS
11 1900.0 1930 -30.0 2233.0 2243 -10.0 DL 1435 N934DL
29 1703.0 1730 -27.0 1947.0 1957 -10.0 F9 837 N208FR
... ... ... ... ... ... ... ... ...
30 NaN 1455 NaN NaN 1634 NaN 9E 3393
30 NaN 2200 NaN NaN 2312 NaN 9E 3525
30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ
3 Sort flights to find the fastest (highest speed)

flights.
In [10]: df['speed'] = df['distance']/df['air_time']
df = df.sort_values(by=['speed'], ascending=False)
df
Out[10]:
sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
1700 9.0 1923.0 1937 -14.0 DL 1499 N666DN LGA
1513 45.0 1745.0 1719 26.0 EV 4667 N17196 EWR
2025 15.0 2225.0 2226 -1.0 EV 4292 N14568 EWR
1910 4.0 2045.0 2043 2.0 EV 3805 N12567 EWR
1600 -1.0 1849.0 1917 -28.0 DL 1902 N956DL LGA
... ... ... ... ... ... ... ... ...
1455 NaN NaN 1634 NaN 9E 3393 NaN JFK
2200 NaN NaN 2312 NaN 9E 3525 NaN LGA
1210 NaN NaN 1330 NaN MQ 3461 N535MQ LGA
4 Which flights travelled the farthest? Which

travelled the shortest?
In [36]: df = df.sort_values(by=['distance'], ascending=False)
df.head()
Out[36]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_del
116675 2013 2 7 900.0 900 0.0 1528.0 1540
261903 2013 7 13 958.0 1000 -2.0 1440.0 1430
62652 2013 11 8 948.0 1000 -12.0 1550.0 1555
203083 2013 5 11 954.0 1000 -6.0 1456.0 1500
69175 2013 11 15 951.0 1000 -9.0 1545.0 1555
In [37]: df.tailnum.iloc[0]
Out[37]: 'N383HA'
In [34]: df = df.sort_values(by=['distance'], ascending=True)

df.head()
Out[34]:
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
NaN 106 NaN NaN 245 NaN US 1632 NaN EWR
1959.0 2000 -1.0 2052.0 2054 -2.0 EV 4457 N19554 EWR
2127.0 2130 -3.0 2304.0 2225 39.0 EV 4619 N11194 EWR
2153.0 2129 24.0 2247.0 2224 23.0 EV 4619 N13913 EWR
2123.0 2130 -7.0 2211.0 2225 -14.0 EV 4619 N12921 EWR
5 Count no of flight (each dest) arrived dest

which are not cancelled?
In [18]: df.dest.value_counts()
Out[18]: ORD 17283

ATL 17215
LAX 16174
BOS 15508
MCO 14082
...
HDN 15
SBN 10
ANC 8
LGA 1
LEX 1
Name: dest, Length: 105, dtype: int64
In [23]: cancelled = df.arr_time.isna().sum()

cancelled
Out[23]: 8713
In [27]: not_cancelled = len(df) - cancelled

not_cancelled
Out[27]: 328063
6 Calculate the number of cancelled flights per

day. Is there a pattern? Is the average no of
cancelled flights (per day) related to the
average delay(per day)?
In [38]: df['date'] = pd.to_datetime(df[['year', 'month','day']])
df.date.head()
Out[38]: 116675 2013-02-07

261903 2013-07-13
62652 2013-11-08
203083 2013-05-11
69175 2013-11-15
Name: date, dtype: datetime64[ns]
In [39]: df['cancelled'] = df['arr_time'].isna()

df.cancelled.head()
Out[39]: 116675 False

261903 False
62652 False
203083 False
69175 False
Name: cancelled, dtype: bool
In [ ]: df.arr_delay = df.arr_delay
In [40]: df1 = df[df['cancelled']==True]
In [41]: df1.groupby('date')['cancelled'].count()
Out[41]: date
2013-01-01 5
2013-01-02 10
2013-01-03 10
2013-01-04 6
2013-01-05 3
..
2013-12-27 2
2013-12-28 1
2013-12-29 19
2013-12-30 14
2013-12-31 16
Name: cancelled, Length: 359, dtype: int64
In [49]: df_new = df1.groupby('date')['cancelled'].count().reset_index(name ='num o

df_new
Out[49]:
date num of flights
0 2013-01-01 5
1 2013-01-02 10
2 2013-01-03 10
3 2013-01-04 6
4 2013-01-05 3
... ... ...
354 2013-12-27 2
355 2013-12-28 1
356 2013-12-29 19
357 2013-12-30 14
358 2013-12-31 16
In [53]: df_a = df.groupby(['date'])['arr_delay'].mean().reset_index(name ='avg arr

df_a
Out[53]:
date avg arr delay
0 2013-01-01 12.651023
1 2013-01-02 12.692888
2 2013-01-03 5.733333
3 2013-01-04 -1.932819
4 2013-01-05 -1.525802
... ... ...
360 2013-12-27 -0.148803
361 2013-12-28 -3.259533
362 2013-12-29 18.763825
363 2013-12-30 10.057712
364 2013-12-31 6.212121
In [54]: df_d = df.groupby(['date'])['dep_delay'].mean().reset_index(name ='avg dep

df_d
Out[54]:
date avg dep delay
0 2013-01-01 11.548926
1 2013-01-02 13.858824
2 2013-01-03 10.987832
3 2013-01-04 8.951595
4 2013-01-05 5.732218
... ... ...
360 2013-12-27 10.937630
361 2013-12-28 7.981550
362 2013-12-29 22.309551
363 2013-12-30 10.698113
364 2013-12-31 6.996053
In [57]: dfs = [df_new, df_a, df_d]

dfs = [df.set_index('date') for df in dfs]
dfs = dfs[0].join(dfs[1:])
dfs
Out[57]:
num of flights avg arr delay avg dep delay
date
2013-01-01 5.0 12.651023 11.548926
2013-01-02 10.0 12.692888 13.858824
2013-01-03 10.0 5.733333 10.987832
2013-01-04 6.0 -1.932819 8.951595
2013-01-05 3.0 -1.525802 5.732218
... ... ... ...
2013-12-27 2.0 -0.148803 10.937630
2013-12-28 1.0 -3.259533 7.981550
2013-12-29 19.0 18.763825 22.309551
2013-12-30 14.0 10.057712 10.698113
2013-12-31 16.0 6.212121 6.996053
In [45]: import matplotlib.pyplot as plt

import seaborn as sns
In [58]: sns.scatterplot(data=dfs,x='num of flights',y='avg dep delay')
Out[58]: <AxesSubplot:xlabel='num of flights', ylabel='avg dep delay'>
In [59]: sns.scatterplot(data=dfs,x='num of flights',y='avg arr delay')
Out[59]: <AxesSubplot:xlabel='num of flights', ylabel='avg arr delay'>
In [61]: sns.scatterplot(data=dfs,x='date',y='num of flights')
Out[61]: <AxesSubplot:xlabel='date', ylabel='num of flights'>
In [63]: sns.lineplot(data=dfs,x='date',y='num of flights')
Out[63]: <AxesSubplot:xlabel='date', ylabel='num of flights'>
iher
In [ ]:

Lab 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab 4

Uploaded by

Copyright:

Available Formats

lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [1]: import pandas as pd

0 2013 1 1 517.0 515 2.0 830.0 819 11.0

1 2013 1 1 533.0 529 4.0 850.0 830 20.0

2 2013 1 1 542.0 540 2.0 923.0 850 33.0

3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0

4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0

1 Sort (ascending and descending) a variable

In [7]: # sorting based on dep_time column

2013 7 1 1.0 2029 212.0 236.0 2359 157.0

2013 12 30 1.0 2359 2.0 441.0 437 4.0

2013 6 20 1.0 2359 2.0 340.0 350 -10.0

2013 5 22 1.0 1935 266.0 154.0 2140 254.0

2013 5 25 1.0 2359 2.0 336.0 341 -5.0

... ... ... ... ... ... ... ... ...

2013 9 30 NaN 1455 NaN NaN 1634 NaN

2013 9 30 NaN 2200 NaN NaN 2312 NaN

2013 9 30 NaN 1210 NaN NaN 1330 NaN

2013 9 30 NaN 1159 NaN NaN 1344 NaN

2013 9 30 NaN 840 NaN NaN 1020 NaN

336776 rows × 19 columns

In ascending sorting, NaN values are at the bottom

2 Sort flights to find the most delayed flights.

7 2040.0 2123 -43.0 40.0 2352 48.0 B6 97 N592JB

3 2022.0 2055 -33.0 2240.0 2338 -58.0 DL 1715 N612DL

10 1408.0 1440 -32.0 1549.0 1559 -10.0 EV 5713 N825AS

11 1900.0 1930 -30.0 2233.0 2243 -10.0 DL 1435 N934DL

29 1703.0 1730 -27.0 1947.0 1957 -10.0 F9 837 N208FR

... ... ... ... ... ... ... ... ...

30 NaN 1455 NaN NaN 1634 NaN 9E 3393

30 NaN 2200 NaN NaN 2312 NaN 9E 3525

30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ

30 NaN 1159 NaN NaN 1344 NaN MQ 3572 N511MQ

30 NaN 840 NaN NaN 1020 NaN MQ 3531 N839MQ

6776 rows × 19 columns

3 Sort flights to find the fastest (highest speed)

1700 9.0 1923.0 1937 -14.0 DL 1499 N666DN LGA

1513 45.0 1745.0 1719 26.0 EV 4667 N17196 EWR

2025 15.0 2225.0 2226 -1.0 EV 4292 N14568 EWR

1910 4.0 2045.0 2043 2.0 EV 3805 N12567 EWR

1600 -1.0 1849.0 1917 -28.0 DL 1902 N956DL LGA

... ... ... ... ... ... ... ... ...

1455 NaN NaN 1634 NaN 9E 3393 NaN JFK

2200 NaN NaN 2312 NaN 9E 3525 NaN LGA

1210 NaN NaN 1330 NaN MQ 3461 N535MQ LGA

1159 NaN NaN 1344 NaN MQ 3572 N511MQ LGA

840 NaN NaN 1020 NaN MQ 3531 N839MQ LGA

4 Which flights travelled the farthest? Which

116675 2013 2 7 900.0 900 0.0 1528.0 1540

261903 2013 7 13 958.0 1000 -2.0 1440.0 1430

62652 2013 11 8 948.0 1000 -12.0 1550.0 1555

203083 2013 5 11 954.0 1000 -6.0 1456.0 1500

69175 2013 11 15 951.0 1000 -9.0 1545.0 1555

In [34]: df = df.sort_values(by=['distance'], ascending=True)

NaN 106 NaN NaN 245 NaN US 1632 NaN EWR

1959.0 2000 -1.0 2052.0 2054 -2.0 EV 4457 N19554 EWR

2127.0 2130 -3.0 2304.0 2225 39.0 EV 4619 N11194 EWR

2153.0 2129 24.0 2247.0 2224 23.0 EV 4619 N13913 EWR

2123.0 2130 -7.0 2211.0 2225 -14.0 EV 4619 N12921 EWR

5 Count no of flight (each dest) arrived dest

Out[18]: ORD 17283

In [23]: cancelled = df.arr_time.isna().sum()