You are on page 1of 2

use hflights data set

(i) How many rows and columns are in table hfights?

cat(" Number of rows>> ", dim(a)[1], "\n Number of columns>> ", dim(a)[2])

(ii) How many different carriers are listed in the table (print a table with
distinct carrier names)?

a %>% distinct(UniqueCarrier)
a %>% distinct(UniqueCarrier) %>% nrow() #totally 15

(iii) Which and how many airports were involved? Consider both origin and
destination
airports!

a %>% distinct(Origin)
a %>% distinct(Dest)
union(unique(hflights$Origin), unique(hflights$Dest))

a %>% select(Origin, Dest) %>%


unique() %>% nrow()

(iv) How many flights were cancelled?

a %>% select(Cancelled) %>%


filter(Cancelled==1) %>% nrow()

(v) (A) Produce a table where statistics for each carrier is shown:
(a) number of flights per carrier
(b) total distance own in miles per carrier
(c) total actual elapsed time in hours per carrier
(d) total air time in hours per carrier
(e) mean distance per flight for each carrier
(f) mean actual elapsed time in hours per flight for each carrier
(g) mean air time in hours per flight for each carrier

statistics <- a %>%


group_by(UniqueCarrier) %>%
summarise(
number_of_flights = n(),
total_distance = sum(Distance),
total_AET = sum(ActualElapsedTime/60, na.rm = T),
total_airtime = sum(AirTime/60, na.rm = T),
mean_of_dist = mean(Distance),
mean_of_AET = mean((ActualElapsedTime/60), na.rm = T),
mean_of_airtime = mean((AirTime/60), na.rm = T)
)

(B) calculate the percentage of total distance flown by top 3 performing


carriers VS total distance
flown by remaining carriers. Execute steps:
(a) First rank carriers by total distance own
(b) top 3 performers are in one group, remaining carriers are in second
group
(c) for each group calculate total distance own
(d) for each group calculate %: total distance flown per group /total
distance all carriers
grouping <- statistics %>%
arrange(desc(total_distance)) %>%
mutate(rank = row_number(),
group = ifelse(rank <=3, 'top3', 'remaining')) %>%
group_by(group) %>%
summarise(
Total_distance_own = sum(total_distance),
Total_distance_percent =
total_distance/sum(total_distance)*100 )

(vi) Modify your main flights table:


(a) create date column by uniting columns: year, month, day of month
(b) when uniting columns do not lose source columns (mutate each column -
with slightly different name, before unite operation is executed)
(c) you will need to parse date column after unite operation
(d) also you should add leading zeros to month and day of month column before
date is created
(e) create columns: quarter, week

modify_table <- a %>%


mutate(DayofMonth = str_pad(DayofMonth, width=2, pad='0'),
Month = str_pad(Month, width=2, pad='0')) %>%
unite(Date, Year, Month, DayofMonth, sep = '-') %>%
mutate(Date = ymd(Date),
quater = quarter(Date),
week = week(Date))

Using your modified table try to answer the given questions:


(a) Is total number of flights increasing or decreasing quarterly?
(b) Is total distance increasing or decreasing monthly?
HINT: dplyr's function lag can assist you when calculating the quarterly or
monthly differences

(a) modify_table %>% group_by(quater) %>%


summarise(numberofflights = n()) %>%
ggplot(., aes(x=quater, y=numberofflights)) +geom_point()
#the first 3 quaters the flights were increasing and last quater it decreased

(b) modify_table %>% group_by(Date) %>%


summarise(total_distance = sum(Distance)) %>%
ggplot(., aes(x=Date, y=total_distance)) + geom_point()

#the distance was increasing till june month and decreased after june

You might also like