You are on page 1of 14

lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [1]: import pandas as pd

In [2]: df = pd.read_csv('flights.csv')
df.head()

Out[2]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay

0 2013 1 1 517.0 515 2.0 830.0 819 11.0

1 2013 1 1 533.0 529 4.0 850.0 830 20.0

2 2013 1 1 542.0 540 2.0 923.0 850 33.0

3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0

4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0

1 Sort (ascending and descending) a variable


having missing values. After sorting check
where missing value are placed (top or
bottom?)

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 1 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [5]: df.isna().sum()

Out[5]: year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 2 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [7]: # sorting based on dep_time column


sorted_df = df.sort_values(by=['dep_time'], ascending=True)
sorted_df

Out[7]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier

2013 7 1 1.0 2029 212.0 236.0 2359 157.0

2013 12 30 1.0 2359 2.0 441.0 437 4.0

2013 6 20 1.0 2359 2.0 340.0 350 -10.0

2013 5 22 1.0 1935 266.0 154.0 2140 254.0

2013 5 25 1.0 2359 2.0 336.0 341 -5.0

... ... ... ... ... ... ... ... ...

2013 9 30 NaN 1455 NaN NaN 1634 NaN

2013 9 30 NaN 2200 NaN NaN 2312 NaN

2013 9 30 NaN 1210 NaN NaN 1330 NaN

2013 9 30 NaN 1159 NaN NaN 1344 NaN

2013 9 30 NaN 840 NaN NaN 1020 NaN

336776 rows × 19 columns

In ascending sorting, NaN values are at the bottom

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 3 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

2 Sort flights to find the most delayed flights.


Find the flights that left earliest.
In [8]: # sorting based on dep_time column
sorted_df = df.sort_values(by=['dep_delay'], ascending=True)
sorted_df

Out[8]:
day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum

7 2040.0 2123 -43.0 40.0 2352 48.0 B6 97 N592JB

3 2022.0 2055 -33.0 2240.0 2338 -58.0 DL 1715 N612DL

10 1408.0 1440 -32.0 1549.0 1559 -10.0 EV 5713 N825AS

11 1900.0 1930 -30.0 2233.0 2243 -10.0 DL 1435 N934DL

29 1703.0 1730 -27.0 1947.0 1957 -10.0 F9 837 N208FR

... ... ... ... ... ... ... ... ...

30 NaN 1455 NaN NaN 1634 NaN 9E 3393

30 NaN 2200 NaN NaN 2312 NaN 9E 3525

30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ

30 NaN 1159 NaN NaN 1344 NaN MQ 3572 N511MQ

30 NaN 840 NaN NaN 1020 NaN MQ 3531 N839MQ

6776 rows × 19 columns

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 4 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

3 Sort flights to find the fastest (highest speed)


flights.
In [10]: df['speed'] = df['distance']/df['air_time']
df = df.sort_values(by=['speed'], ascending=False)
df

Out[10]:
sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin

1700 9.0 1923.0 1937 -14.0 DL 1499 N666DN LGA

1513 45.0 1745.0 1719 26.0 EV 4667 N17196 EWR

2025 15.0 2225.0 2226 -1.0 EV 4292 N14568 EWR

1910 4.0 2045.0 2043 2.0 EV 3805 N12567 EWR

1600 -1.0 1849.0 1917 -28.0 DL 1902 N956DL LGA

... ... ... ... ... ... ... ... ...

1455 NaN NaN 1634 NaN 9E 3393 NaN JFK

2200 NaN NaN 2312 NaN 9E 3525 NaN LGA

1210 NaN NaN 1330 NaN MQ 3461 N535MQ LGA

1159 NaN NaN 1344 NaN MQ 3572 N511MQ LGA

840 NaN NaN 1020 NaN MQ 3531 N839MQ LGA

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 5 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

4 Which flights travelled the farthest? Which


travelled the shortest?
In [36]: df = df.sort_values(by=['distance'], ascending=False)
df.head()

Out[36]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_del

116675 2013 2 7 900.0 900 0.0 1528.0 1540

261903 2013 7 13 958.0 1000 -2.0 1440.0 1430

62652 2013 11 8 948.0 1000 -12.0 1550.0 1555

203083 2013 5 11 954.0 1000 -6.0 1456.0 1500

69175 2013 11 15 951.0 1000 -9.0 1545.0 1555

In [37]: df.tailnum.iloc[0]

Out[37]: 'N383HA'

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 6 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [34]: df = df.sort_values(by=['distance'], ascending=True)


df.head()

Out[34]:
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin

NaN 106 NaN NaN 245 NaN US 1632 NaN EWR

1959.0 2000 -1.0 2052.0 2054 -2.0 EV 4457 N19554 EWR

2127.0 2130 -3.0 2304.0 2225 39.0 EV 4619 N11194 EWR

2153.0 2129 24.0 2247.0 2224 23.0 EV 4619 N13913 EWR

2123.0 2130 -7.0 2211.0 2225 -14.0 EV 4619 N12921 EWR

5 Count no of flight (each dest) arrived dest


which are not cancelled?
In [18]: df.dest.value_counts()

Out[18]: ORD 17283


ATL 17215
LAX 16174
BOS 15508
MCO 14082
...
HDN 15
SBN 10
ANC 8
LGA 1
LEX 1
Name: dest, Length: 105, dtype: int64

In [23]: cancelled = df.arr_time.isna().sum()


cancelled

Out[23]: 8713

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 7 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [27]: not_cancelled = len(df) - cancelled


not_cancelled

Out[27]: 328063

6 Calculate the number of cancelled flights per


day. Is there a pattern? Is the average no of
cancelled flights (per day) related to the
average delay(per day)?
In [38]: df['date'] = pd.to_datetime(df[['year', 'month','day']])
df.date.head()

Out[38]: 116675 2013-02-07


261903 2013-07-13
62652 2013-11-08
203083 2013-05-11
69175 2013-11-15
Name: date, dtype: datetime64[ns]

In [39]: df['cancelled'] = df['arr_time'].isna()


df.cancelled.head()

Out[39]: 116675 False


261903 False
62652 False
203083 False
69175 False
Name: cancelled, dtype: bool

In [ ]: df.arr_delay = df.arr_delay

In [40]: df1 = df[df['cancelled']==True]

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 8 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [41]: df1.groupby('date')['cancelled'].count()

Out[41]: date
2013-01-01 5
2013-01-02 10
2013-01-03 10
2013-01-04 6
2013-01-05 3
..
2013-12-27 2
2013-12-28 1
2013-12-29 19
2013-12-30 14
2013-12-31 16
Name: cancelled, Length: 359, dtype: int64

In [49]: df_new = df1.groupby('date')['cancelled'].count().reset_index(name ='num o


df_new

Out[49]:
date num of flights

0 2013-01-01 5

1 2013-01-02 10

2 2013-01-03 10

3 2013-01-04 6

4 2013-01-05 3

... ... ...

354 2013-12-27 2

355 2013-12-28 1

356 2013-12-29 19

357 2013-12-30 14

358 2013-12-31 16

359 rows × 2 columns

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 9 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [53]: df_a = df.groupby(['date'])['arr_delay'].mean().reset_index(name ='avg arr


df_a

Out[53]:
date avg arr delay

0 2013-01-01 12.651023

1 2013-01-02 12.692888

2 2013-01-03 5.733333

3 2013-01-04 -1.932819

4 2013-01-05 -1.525802

... ... ...

360 2013-12-27 -0.148803

361 2013-12-28 -3.259533

362 2013-12-29 18.763825

363 2013-12-30 10.057712

364 2013-12-31 6.212121

365 rows × 2 columns

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 10 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [54]: df_d = df.groupby(['date'])['dep_delay'].mean().reset_index(name ='avg dep


df_d

Out[54]:
date avg dep delay

0 2013-01-01 11.548926

1 2013-01-02 13.858824

2 2013-01-03 10.987832

3 2013-01-04 8.951595

4 2013-01-05 5.732218

... ... ...

360 2013-12-27 10.937630

361 2013-12-28 7.981550

362 2013-12-29 22.309551

363 2013-12-30 10.698113

364 2013-12-31 6.996053

365 rows × 2 columns

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 11 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [57]: dfs = [df_new, df_a, df_d]


dfs = [df.set_index('date') for df in dfs]
dfs = dfs[0].join(dfs[1:])
dfs

Out[57]:
num of flights avg arr delay avg dep delay

date

2013-01-01 5.0 12.651023 11.548926

2013-01-02 10.0 12.692888 13.858824

2013-01-03 10.0 5.733333 10.987832

2013-01-04 6.0 -1.932819 8.951595

2013-01-05 3.0 -1.525802 5.732218

... ... ... ...

2013-12-27 2.0 -0.148803 10.937630

2013-12-28 1.0 -3.259533 7.981550

2013-12-29 19.0 18.763825 22.309551

2013-12-30 14.0 10.057712 10.698113

2013-12-31 16.0 6.212121 6.996053

359 rows × 3 columns

In [45]: import matplotlib.pyplot as plt


import seaborn as sns

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 12 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [58]: sns.scatterplot(data=dfs,x='num of flights',y='avg dep delay')

Out[58]: <AxesSubplot:xlabel='num of flights', ylabel='avg dep delay'>

In [59]: sns.scatterplot(data=dfs,x='num of flights',y='avg arr delay')

Out[59]: <AxesSubplot:xlabel='num of flights', ylabel='avg arr delay'>

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 13 of 14
lab4 - Jupyter Notebook 1/24/23, 5:12 PM

In [61]: sns.scatterplot(data=dfs,x='date',y='num of flights')

Out[61]: <AxesSubplot:xlabel='date', ylabel='num of flights'>

In [63]: sns.lineplot(data=dfs,x='date',y='num of flights')

Out[63]: <AxesSubplot:xlabel='date', ylabel='num of flights'>

iher
In [ ]:

http://localhost:8888/notebooks/DMML/lab4.ipynb# Page 14 of 14

You might also like