You are on page 1of 6

Big Data Analysis in R

R for Biologists Class


3/7/2017

Big Data in R Example Using USA Commerical Flight Data from 1997 to 2008

The files and scripts associated with this tutorial are available in HCCGo as the R Advanced Tutorial
We are going to look at the past 10 years of US flight data and see if we can determine some possible causes
for flight delays. Our first step is to acquire and prepare our data for our analysis.

Data Download and Cleaning

Preprocessing of the data was done using Linux shell commands. For those with experience with Linux shell
scripting, you can find the script here: https://git.unl.edu/snippets/24. For everyone else, the steps done in
this preprocessing are:
1. Create data folder in which to put all data files
2. Download indivdiual csv files for each year from http://www.stat-computing.org
3. Unzip each individual csv file
4. Concatonate all individual csv files into a single airline.csv file, being careful to only include the
header from the first data file
5. Randomly subsample 100,000 data points from each individual year and concatonate them into
airline_subsample.csv
6. Select only the Year, Month, TailNum, ArrDelay and Origin columns from airline_subsample.csv.
This is the file that we will use for our analysis
7. Remove extraneous non-escape characters (’) from the data to prevent downstream errors
8. Create truncated version of the data airline_trunc.csv containing only 1000 data points to use for
testing of the script offline
9. Clean up by deleting individiual year csv file
To run this script on your own, you will want to load the tutorial and submit the get_data job.
If you compare the sizes of the airline.csv, airline_subsample.csv and the airline_trunc.csv file, you
will see that the size of the datafile was reduced by 99% just by removing fields that we will not be using
by only subsampling a certain number of data points from each year. Deciding what aspect of your data
is important to your analysis and cleaning out any extra information is an important component of data
analysis and can quickly speed up your workflow.
Now that we have finished preprocessing our data, we can move into R to begin our analysis.
If you wish to build and test your script in RStudio, download the airline_trunc.csv file and place it into
your project folder.

Load Required Packages


As with any analysis, begin by loading any relevant packages. Putting these commands at the top of your
script is good programming practice as it allows for portability of code. Anyone who looks at your code can
tell immediately what packages are required and can ensure they are loaded before running the script.

1
As you can see below, we will be using the dplyr, ggplot2, and the maps package.
# LARGE data example - using airline.csv

# load libraries
if(!require(dplyr)) install.packages("dplyr")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(maps)) install.packages("maps")

If you have previously used library() to load packages, the above code might look a bit strange to you.
These if statements check to see if a package is installed, by using the require() command. !require()
returns FALSE if a package is installed and loads correctly and TRUE if a package is not installed. Therefore,
if a package is not installed (the conditional is TRUE) then the install.packages() command executes and
the package is installed and loaded. Using this instead of library() helps ensure that your code is portable,
meaning you can share the code with someone else and it will be less likely to error out.

Load the Data


Now we load our data into R using read.csv. If you are building your script offline in RStudio, use
airline_trunc.csv as your filename while testing. Be sure to replace it later when you upload the script to
run on the full datafile.
# Load airline.csv data
flights <- read.csv("./data/airline_subsample.csv",
sep=",",
header=TRUE,
stringsAsFactors=FALSE)

Notice that we are using the stringsAsFactors flag as FALSE because we want R to preserve character data
columns as characters and not automatically convert them to factors. If needed, we can convert specific
columns back to factors by using the factor() command as you will see later on.

Custom Functions
The first thing we want to look at is whether the age of the aircraft influences flight delays. To do this, we
need to determine how old each airplane is at the time of the flight. If we had access to the date each airplane
was built, this would be easy. However, our data only contains Year and Month values for the date of each
flight. So we are going to estimate the birthmonth by determining the Month and Year of the first flight
recorded for each aircraft. Since we are going to be performing this calculating multiple times, it is good
practice to create a custom function that does the calculation for us.
# Function to estimate birthmonth of aircraft by finding the month and year of first flight
birthmonth <- function(y){
minYear <- min(y[,'Year'], na.rm=TRUE)
these <- which(y[,'Year']==minYear)
minMonth <- min(y[these, 'Month'], na.rm=TRUE)
return(12 * minYear + minMonth - 1)
}

To declare the function, we use the function() command followed by the commands we want R to execute
when our function is called. The y inside the parenthesis tells R how we are going to refer to the data being
sent to the function. This function determines the minimum value for Year for the data, then selects only
the flights from that year. Then the minimum month is determined from that subset. The function then
returns the birthmonth as a calculation of the minimum Year and minimum Month.

2
Clean Data
Even though we did some cleaning during our preprocessing, we need to do more now that we have loaded
the data in R. Removing incomplete or missing data (indicated by NA) is easier and more efficent in R than
with the shell commands we used previously. Since we are indiciating individual planes by thier TailNum we
need to remove any entries that are missing that information. If you are still testing your script in RStudio,
you can use the nrow() command before and after the following code to get an idea of how much of the data
is being removed.
# Remove flights with no tail number recorded
flights <- flights[!is.na(flights$ArrDelay),]

Calculate birthmonth for Each Aircraft


Now we want to go through our data and calculate the birthmonth for each individual aircraft. To do that,
we create a new vector entitled aircrafts, and fill it by selecting only unique entries of flights$TailNum
using the unique() command.
# Create vectors for each aircraft (aircrafts) and store their birthmonth (acStart)
aircrafts <- unique(flights[,'TailNum'])
aircrafts <- aircrafts[!is.na(aircrafts)]

Then we will create a vector array entitled acStart which will hold the birthmonth associated with each
entry in our aircrafts array. To do this, we create an empty vector the same length as the aircraft vector
and fill it with 0’s. Then we assign the same names from aircrafts to acStart so we can easily associate
the data.
acStart <- rep(0, length(aircrafts))
names(acStart) <- aircrafts

Now that we have a place to store our birthmonth values, we can iterate (cycle through) each individual
aircraft and calculate the birthmonth. Notice how we created an empty vector to store this data rather than
adding each value and “growing” our vector through our loop? This is good practice, because R does not
handle growing vectors well. They eat resources as R has to copy all of the values each time to create a new
temporary object. Therefore, it is always better to create an empty vector (matrix, array, dataframe, etc) the
length that you will need and fill in values as you calculate them.
for (i in aircrafts) {
acStart[i] <- birthmonth(flights[flights$TailNum==i,])
}

The last thing we want to do is use the birthmonth to calculate the age of each flight. First, we create a data
frame which contains the aircrafts TailNum and the calculated birthmonth. Then using left_join() from
the dplyr package, we can add the acStart column to our flights dataframe. Notice that we are joining
by TailNum. This means that any value of acStart will be added to any row which has the same TailNum.
So even though our age data frame and our flights data frame are different lengths, we can merge the data
based on this common field. Lastly, we want to use this start date to calculate the age of the airplane at the
time of each flight. We can do this using the mutate function from dplyr. We will create a new column Age
as a calculation of Year, Month and the flight’s birthmonth acStart.
# Calculate flight age using the birthmonth
age <- data.frame(names(acStart), acStart, stringsAsFactors = FALSE)
colnames(age) <- c("TailNum", "acStart")
flights <- left_join(flights, age, by="TailNum")
flights <- mutate(flights, Age = (flights$Year * 12) + flights$Month - flights$acStart

Statistical Analysis of ArrDelay as a function of Age and Year

3
Now that we have estimated values for the age of each flight, we can perform our statistical analysis. To do
this, we will fit a linear model to our data.
# Generate linear model for response: ArrDelay and predictors: Age and Year
lm <- lm(ArrDelay ~ Age + Year, data=flights)
summary(lm)

Once your script finishes executing, You can look at the output of summary(lm) by viewing the Rout file.
Does it appear that delay is associated with age of the flights?

Weather
Flight delays can be the cause of poor weather conditions. So if we look at delays by month, perhaps we will
see a change in delays over winter months as opposed to summer ones. To analyze this, let’s create a plot of
delays separated by each month.
As with any analysis, we need to prepare our data. First we will convert our month entries to a factor and
assign names to the values instead of the numbers we were using. This will make our plot easier to read.
# Convert Months from number to factor
flights$Month <- factor(flights$Month)
levels(flights$Month) <- month.abb

Next, we are going to select only the fields we need; Month and ArrDelay. Since we are no longer interested
in the year or the flight numbers we can use select() to pick out only the fields we need. This will make
the commands run quicker. When dealing with large data, you need to be aware of resources and use as few
as possible to avoid overwhelming the system you are working on.
# Select a subset of fields needed to graph arrival delays by month
subset_month <- select(flights, Month, ArrDelay)

Now that we have completed preparing our data, we can create our plot. Since we have so many data points
we want to choose a plot format that can best display our data with enough detail to be informative, but not
enough that it is busy or overwhelming to look at. So something like a scatter plot might not be the best use
here because the sheer number of points can overwhelm the graph. Let’s be fancy and create a violin plot.
If you are not familar with violin plots, they are a kind of merging between a boxplot and a distribution
curve. Wider areas of the plot indicate higher frequency in the data at a particular value, and skinnier areas
indicate lower frequency of data points in that region.
# Create violin graph showing arrival delays by month
ggplot(subset_month, aes(Month,ArrDelay, fill=factor(Month))) +
geom_violin(aes(group=Month)) +
theme(legend.position="none") +
labs(y = "Arrival Delay (in minutes)") +
labs(title = "Average Flight Arrival Delay by Month")

ggsave("ave_delay_by_month.jpg", width=9, height=6)

Notice that since we are going to run our script in a command line environment, we will not be able view
our plot interactively like we do when using RStudio. Therefore, we use the ggsave() command to save our
plot as a jpg file for viewing after our script executes. If you are unsure of what each line in our ggplot()
command does, try removing or commenting them out individually to see how your plot changes (be sure to
watch the + signs to ensure they are included between the lines that are actually executing and not on the
last one).

State of Origin

4
The last aspect we want to look at is whether the airport is responsible for flight delays. Anyone who flies
frequently can tell you that some airports are much busier and appear to have more frequent flight delays
than others. Let’s see how true that might be by creating a plot that displays the average delays by state.
To do this, we want to load in an additional file which associates the airport codes found on our data under
Origin and determine which state that airport is located in.
# Load state list for airport codes and join departure state to flights dataframe
airport_codes <- read.csv("airport_codes.csv",
col.names=c("OriginState", "Origin"),
stringsAsFactors = FALSE)
flights <- left_join(flights, airport_codes, by="Origin")

We are specifying col.names here because we want to ensure that the airport codes column is named the
same as the column in our data frame. If the columns have different names, we will not be able to easily join
the two data frames using left_join().
After loading the csv file airport_codes.csv which contains the airport code and the state at which the
airport is located in, we use the dplyr command left_join which will add the column airport_codes to
our data frame by matching values in the Origin columns from both data frames.
As before, we want to select only the columns of interest and clean the data by removing any values that do
not have ArrDelay values. Then we will use the group_by() command to group our data by OriginState
and calculate an AveDelay value for each state.
# Create subset of data containing origin state and arrival delay
subset_state <- select(flights, OriginState, ArrDelay)
subset_state <- subset_state[!is.na(subset_state$ArrDelay),]
subset_state <- group_by(subset_state, OriginState)
subset_summary <- summarise(subset_state, AveDelay=mean(ArrDelay))

We could simply look at subset_summary to see what the average delay is for each state. But let’s have more
fun with it by creating a plot of the US where each state is colored based on it’s average flight delay.
To do this, we are going to use map_data from the maps package to load a state map of the US. If you load
the package in RStudio, you can see that the map data is stored as a data frame of vertices which specify the
outline of each individual state.
# Create graphic of US States colored by average delay time
map = map_data("state")

Next, we create a ggplot to fill in each shape with a color associated with the AveDelay value.
ggplot(subset_summary, aes(fill=AveDelay)) +
geom_map(aes(map_id=OriginState), map=map) +
scale_fill_distiller(name = "Average Delay (mins)", palette = "Spectral", direction=-1) +
expand_limits(x=map$long, y=map$lat) +
theme_void() +
labs(title = "Average Flight Arrival Delay by State")

ggsave("ave_delay_by_state.jpg", width=9, height=5)

That is a lot of code, so let’s break it down and discuss what each command is doing here:
• ggplot(subset_summary, aes(fill=AveDelay)) tells R that we want to create a ggplot using the
data subset_summary and we are specifying that we want the aesthetic fill to be determined by
AveDelay.
• scale_fill_distiller(name = "Average Delay (mins)", palette = "Spectral", direction=-1)
specifies the name of our legend (“Average Delay(mins)”), the type of color palette we want to use

5
(“Spectral”) and direction=-1 specifies that we want to reverse the coloration of the states (the
highest value is the darkest color). You can play around with different color palettes by changing the
value here. To see a list of available palettes, type ?scale_fill_distiller to view the help file.
• expand_limits(x=map$long, y=map$lat) is what tells R that we want the color to fill the entire
shape outlined by our points in our map data.
• theme_void() assigns the void theme to our map, which removes the grids and axes.
• labs(title = "Average Flight Arrival Delay by State") gives our plot a nice title
Once again, we used ggsave() to save our plot in an jpg file so we can view it once execution is finished.
Now that your script is done, upload the file to the cluster and submit it using the R_analysis job.