Assignment 2 Solution

© All Rights Reserved

94 views

Assignment 2 Solution

© All Rights Reserved

- Curve Fitting for Programmable Calculators_Kolb (OCR)
- Chapter 5 - Cost Estimation
- [MAS] 01_Costs and Cost Concepts_for Printing
- Statistics-17 by Keller
- Principles and Risks of Forecasting-Robert Nau
- Stats Sample Book
- Mreg e Print
- v13n5
- CrossTalk 201101 0 Issue
- Measures of Central Tendency
- StudyQuestions-Regression(Logarithms).xls
- chapter 08 part 2
- Proiect Ovidiu 1
- 38
- Family Adjustment in an Egyptian Village
- redo solution seeker
- term project details
- A Statistical Approach for X-Ray image measurements
- Sample Saad
- Student Labour Force Survey 2013-14.pdf

You are on page 1of 6

On any given day, more than 87,000 flights take place in the United States alone.

About one-third of these flights are commercial flights, operated by companies

like United, American Airlines, and JetBlue. While about 80% of commercial

flights take-off and land as scheduled, the other 20% suffer from delays due to

various reasons. A certain number of delays are unavoidable, due to unexpected

events, but some delays could hopefully be avoided if the factors causing delays

were better understood and addressed.

In this problem, we'll use a dataset of 9,381 flights that occurred in June through

August of 2014 between the three busiest US airports -- Atlanta (ATL), Los

Angeles (LAX), and Chicago (ORD) -- to predict flight delays. The dataset

AirlineDelay.csv includes the following 23 variables:

2. Carrier = the carrier operating the flight (American Airlines, Delta Air

Lines, etc.)

3. Month = the month of the flight (June, July, or August)

4. DayOfWeek = the day of the week of the flight (Monday, Tuesday, etc.)

5. NumPrevFlights = the number of previous flights taken by this aircraft in

the same day

6. PrevFlightGap = the amount of time between when this flight's aircraft is

scheduled to arrive at the airport and when it's scheduled to depart for

this flight

7. HistoricallyLate = the proportion of time this flight has been late

historically

8. InsufficientHistory = whether or not we have enough data to determine

the historical record of the flight (equal to 1 if we don't have at least 3

records, equal to 0 if we do)

9. OriginInVolume = the amount of incoming traffic volume at the origin

airport, normalized by the typical volume during the flight's time and day

of the week

10.OriginOutVolume = the amount of outgoing traffic volume at the origin

airport, normalized by the typical volume during the flight's time and day

of the week

11.DestInVolume = the amount of incoming traffic volume at the destination

airport, normalized by the typical volume during the flight's time and day

of the week

12.DestOutVolume = the amount of outgoing traffic volume at the destination

airport, normalized by the typical volume during the flight's time and day

of the week

13.OriginPrecip = the amount of rain at the origin over the course of the day,

in tenths of millimeters

14.OriginAvgWind = average daily wind speed at the origin, in miles per hour

15.OriginWindGust = fastest wind speed during the day at the origin, in miles

per hour

16.OriginFog = whether or not there was fog at some point during the day at

the origin (1 if there was, 0 if there wasn't)

17.OriginThunder = whether or not there was thunder at some point during

the day at the origin (1 if there was, 0 if there wasn't)

18.DestPrecip = the amount of rain at the destination over the course of the

day, in tenths of millimeters

19.DestAvgWind = average daily wind speed at the destination, in miles per

hour

20.DestWindGust = fastest wind speed during the day at the destination, in

miles per hour

21.DestFog = whether or not there was fog at some point during the day at

the destination (1 if there was, 0 if there wasn't)

22.DestThunder = whether or not there was thunder at some point during the

day at the destination (1 if there was, 0 if there wasn't)

23.TotalDelay = the amount of time the aircraft was delayed, in minutes (this

is our dependent variable)

Load the dataset AirlineDelay.csv into R and call it "Airlines". Randomly split it

into a training set (70% of the data) and testing set (30% of the data) by running

the following lines in your R console:

set.seed(15071)

spl = sample(nrow(Airlines), 0.7*nrow(Airlines))

AirlinesTrain = Airlines[spl,]

AirlinesTest = Airlines[-spl,]

Answer: 6566

How many observations are in the testing set AirlinesTest? (2 marks)

Answer: 2815

PROBLEM 2 - A LINEAR REGRESSION MODEL

Build a linear regression model to predict "TotalDelay" using all of the other

variables as independent variables. Use the training set to build the model.

What is the model's R-squared? (Please report the "Multiple R-squared" value in

the output.) (2 marks)

Answer: 0.09475

PROBLEM 3 - CHECKING FOR SIGNIFICANCE

In your linear regression model, which of the independent variables are

significant at the p=0.05 level (at least one star)? For factor variables, consider

the variable significant if at least one level is significant. Write all that apply. (2

marks)

PROBLEM 4 - CORRELATIONS

What is the correlation between NumPrevFlights and PrevFlightGap in the

training set? (2 marks)

Answer: -0.652053189

What is the correlation between OriginAvgWind and OriginWindGust in the

training set? (2 marks)

Answer: 0.509953488

Hint: Identify/find out a function that will calculate correlation between variables

Why is it important to check for correlations between independent variables?

Select one of the following: (2 marks)

1. Having highly correlated independent variables in a regression model can

affect the interpretation of the coefficients.

affect the quality of the resulting predictions.

PROBLEM 6 - COEFFICIENTS

In the model with all of the available independent variables, what is the

coefficient for HistoricallyLate? (2 marks)

Answer: 47.913638

PROBLEM 7 - UNDERSTANDING THE COEFFICIENTS

The coefficient for NumPrevFlights is 1.56. What is the interpretation of this

coefficient? Choose one of the following: (2 marks)

flights increases by approximately 1.56.

For an increase of 1 in the number of previous flights, the prediction of the

total delay increases by approximately 1.56.

If the number of previous flights increases by 1, then the total delay will

definitely increase by approximately 1.56; the number of previous flights

should be minimized if airlines want to decrease the amount of delay.

Let us try to understand our model.

In the linear regression model, given two flights that are otherwise identical,

what is the absolute difference in predicted total delay given that one flight is on

Thursday and the other is on Sunday? (2 marks)

Answer: 6.989857

In the linear regression model, given two flights that are otherwise identical,

what is the absolute difference in predicted total delay given that one flight is on

Saturday and the other is on Sunday? (2 marks)

Answer: 0.911413

PROBLEM 9 - PREDICTIONS ON THE TEST SET

Make predictions on the test set using your linear regression model. What is the

Sum of Squared Errors (SSE) on the test set? Hint: Use predict function for

prediction. You can then take your data in excel by write.csv function and

compute SSE. You can very well do this R as well. However, the choice is left to

you. (2 marks)

Answer: 4744764

Let's turn this problem into a multi-class classification problem by creating a new

dependent variable. Our new dependent variable will take three different values:

"No Delay", "Minor Delay", and "Major Delay". Create this variable, called

console:

IMPORTANT: BEFORE YOU DO THIS STEP SAVE YOUR ORIGINAL DATA

FRAME. YOU MAY USE FOLLOWING CODE:

AirlinesOriginal=Airlines

Now you may proceedAirlines$DelayClass = factor(ifelse(Airlines$TotalDelay == 0, "No

Delay", ifelse(Airlines$TotalDelay >= 30, "Major Delay", "Minor Delay")))

How many flights in the dataset Airlines had no delay? (2 marks)

Answer: 4688

How many flights in the dataset Airlines had a minor delay? (2 marks)

Answer: 3096

How many flights in the dataset Airlines had a major delay? (2 marks)

Answer: 1597

Now, remove the original dependent variable "TotalDelay" from your dataset with

the command: Airlines$TotalDelay = NULL

PROBLEM 12 - A CART MODEL

Build a CART model to predict "DelayClass" using all of the other variables as

independent variables and the training set to build the model. Remember that to

predict a multi-class dependent variable, you can use the rpart function in the

same way as for a binary classification problem. Just use the default parameter

settings.

How many split are in the resulting tree? (2 marks)

Answer: 2

The CART model you just built splits first on which variable? (2 marks)

Answer: Historically Late

PROBLEM 13- LOGISTIC REGRESSION

Use following command in your R console to create dichotomous dependant

variable:

AirlinesOriginal$DelayClass = factor(ifelse(AirlinesOriginal$TotalDelay

== 0, "No Delay", Delay"))

What is baseline accuracy? (2 marks)

Answer: 50%

What is the model accuracy? (2 marks)

Answer: 65%

PROBLEM 14- RANDFOREST

Run RandomForest model and identify most significant variable using varImpPlot.

(5 Marks)

Answer: Historically Late

- Curve Fitting for Programmable Calculators_Kolb (OCR)Uploaded byFred E. Lusk III
- Chapter 5 - Cost EstimationUploaded byalleyezonmii
- [MAS] 01_Costs and Cost Concepts_for PrintingUploaded byCykee Hanna Quizo Lumongsod
- Statistics-17 by KellerUploaded bycookiehacker
- Principles and Risks of Forecasting-Robert NauUploaded bylelouch
- Stats Sample BookUploaded byapi-3857574
- Mreg e PrintUploaded byscreenshotc
- v13n5Uploaded byShaheryar Munir
- CrossTalk 201101 0 IssueUploaded bycuyaken
- Measures of Central TendencyUploaded bynakul_sehgal_2
- StudyQuestions-Regression(Logarithms).xlsUploaded bymaster8875
- chapter 08 part 2Uploaded byapi-232613595
- Proiect Ovidiu 1Uploaded byCiprian Macarie
- 38Uploaded byakita_1610
- Family Adjustment in an Egyptian VillageUploaded byTI Journals Publishing
- redo solution seekerUploaded byapi-309854064
- term project detailsUploaded byapi-316152589
- A Statistical Approach for X-Ray image measurementsUploaded byJPFJ
- Sample SaadUploaded bysaad
- Student Labour Force Survey 2013-14.pdfUploaded byTina Uniyal
- Statistics Z score workUploaded byHughbert HanLon
- Correlation Ang RegressionUploaded byJobelle Cariño Resuello
- THRESHOLD RATE OF INFLATION AND ECONOMIC GROWTH: EMPIRICAL EVIDENCE FROM INDIA.Uploaded byIJAR Journal
- Regrerssion AnalysisUploaded byAnkur
- Cost and Decision MakingUploaded byAnonymous qAegy6G
- Geekiyanage Ramachandra Rotimi PublishedUploaded byPatrickdz
- AN ANALYSIS OF PRICE BEHAVIOUR OF RICEUploaded byTJPRC Publications
- Exam 3 ReviewUploaded bySergio
- Fip Eco 701 Class 2 Team 2Uploaded bySameer Kumar
- Chapter 1Uploaded byRajeshkumar

- Bachelor-Project42.pdfUploaded byMarco Figueira
- Modern Management Theory LessonUploaded byAarish Panjwani
- ctt-irtUploaded byAin Kyra
- Crowding and Personal Space Invasion on the Train Please Don’t MakeUploaded byAstriana Erlinda
- 0292-0310.pdfUploaded byMadhvi Sharma
- Bba Revised Cbcs SyllabusUploaded byDarshan Kempe Gowda
- Wiki. Normal DistributionUploaded bybrunoschwarz
- SCA1987-01Uploaded byRosa K Chang H
- Factors Affecting Grade 12 Senior High School Students for Pursuing Medical and Health care Courses in CollegeUploaded byAlab Kayumanggi
- normal_distribution - C++ ReferenceUploaded byJuan Sebastián Poveda Gulfo
- RTT AssignmentUploaded byTalha Abdul Rauf
- 6.0 - Test of ProportionUploaded byHabib Mrad
- Test 1 Geomatics Answer SchemeUploaded byAhmad Zulfadzli
- Factorial DesignsUploaded bydrose2109
- Measuring the Size of a Treatment Effect...Uploaded byThe Khuc
- 4_Early Warning System of Finance Stress for India Guru Review of Applied Economics 2016Uploaded byz_k_j_v
- Oberauer.kliegl.2006.JMLUploaded byJon2170
- Jurnal Pendukung 3Uploaded byListya Dewi
- Distributions of Random VariablesUploaded byghromis
- Topographic analysis: change detection algorithmUploaded byneil_o_leary
- operations management chapter 6 solutionUploaded byjosembosem
- Random Number TableUploaded bycolleenf-1
- RR-06-03Uploaded byrajaabid
- DEFLATED SHARPE RATIOUploaded byalexbernal0
- Chow TestUploaded byHector Garcia
- D4530Uploaded byrimi7al
- fx-95MS_500MS_ENUploaded byDanny Nguyen
- Power Analysis for Two-group Independent Sample T-test _ R Data Analysis ExamplesUploaded byMurali Dharan
- Wilcoxon Sign Rank TestUploaded byMan Ah Keow
- Abrupt QCDUploaded byDheerajKumar

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.