Professional Documents
Culture Documents
Team: 10
Group – 1
Anirudh G. (2201006)
Shimona Francis (2201218)
Arfa Aliya (2201086)
Radhika Rastogi (2201050)
Navya Sree Naragani (2202122)
Gautam M (2201102)
1
Table of Contents
Case Overview.................................................................................................................................3
Preliminary Data Examination in R Studio.....................................................................................3
Exploratory Data Analysis and Insights from EDA........................................................................4
Insight-1:......................................................................................................................................4
Insight-2:......................................................................................................................................4
Insight-3:......................................................................................................................................4
Identifying Best-fit Simple Regression equation.............................................................................5
Visualizing the best-fit.................................................................................................................8
Determining the Best-Fit Multiple Regression equation...............................................................10
Interpretation of the multiple regression results........................................................................11
Conclusions and Recommendations..............................................................................................13
Potential factors to model improvement........................................................................................13
List of Figures
Figure 1- Summary statistics of the Airfares data...........................................................................3
Figure 2- Distribution of the airfares (frequency)...........................................................................4
Figure 3- Variation in Airfare distribution w.r.t presence of SW airlines.......................................4
Figure 4- Scatter plot of Fares vs. Distance covered in route(s).....................................................5
Figure 5- Regression output of Fit-1...............................................................................................6
Figure 6- Regression output of Fit-2...............................................................................................7
Figure 7- Regression output of Fit-3...............................................................................................8
Figure 8- Identifying the Best fit equation......................................................................................9
Figure 9- Scatter plot- Best fit of Simple Regression equation.......................................................9
Figure 10- Multiple Regression equation (Fit4)............................................................................10
Figure 11- Multiple Regression equation (Fit5)............................................................................11
2
Case Overview
Due to airline deregulation in the late 1990s, major US cities experienced airport congestion.
Southwest Airlines rose to prominence as a low-cost carrier, competing on existing routes but
also establishing nonstop service. Converting retired military bases or smaller airports into
regional or commercial airports was considered to alleviate congestion. A consulting firm in
aviation sought advisory contracts to assist the many stakeholders involved. They required
predictive models to support their consulting services, which included forecasting fares if new
airports were built. From Q3-1996 to Q2-1997, the firm examined real-world data, taking into
account variables such as route characteristics, market concentration, income, population,
congestion measures, distance, passengers, and average rates. The influence of Southwest
Airlines' presence on fares was a significant component of the investigation.
3
Figure 1- Summary statistics of the Airfares data
Insight-1:
The distribution of the fares w.r.t their
frequency in the data seems to be
skewed to its right with the higher
side of fares relatively appearing
lesser often w.r.t the medium and low-
fare datasets.
Insight-2:
Analyzing the boxplot of fares w.r.t
presence of Southwest Airlines to
identify if any noticeable difference is
seen w.r.t the range(s). As it is evident
from Figure 2, there is substantial
4
Figure 2- Distribution of the airfares (frequency)
difference in the fare slabs with presence and absence of Southwest airlines. With presence of
SW, apart from few outliers, majority of the fare(s) fall within a $125 slab whereas without the
SW presence in certain route(s), the slab takes off only from a $130 slab and ranges up to $240
airfares as visible. This suggests that the presence of SW airlines might have a significant impact
on the airfares.
Insight-3:
Examining the scatter plot of fares vs.
distance covered in the routes denotes
the relationship between the two
variables has a clear trend in the short-
distance route(s) slab as it is evident
from figure 4. The trend begins to
scatter and the proportionality is not
straight indicating a partial direct
correlation in overall senses, i.e., lower Figure 3- Variation in Airfare distribution w.r.t presence
fares for shorter distances whereas the of SW airlines
vice-versa does not have a clear-cut
denotion.
Three regression analyses were coded and run on all three equations to compare their goodness-
of-fit measurements and choose the one with the greatest R-squared or adjusted R-squared value,
which indicates a better match to the data. Performing the simple regression involves singleton
variables in action to estimate the correlations and relationship. Three fits were considered
including ‘Fare-Distance’, ‘Fare-Number of New Carriers’, ‘Fare-HI value (Herfindahl Index)’.
Summary and plot functions were used to visualize and derive the insights of the regression
models. The results for each of the equation mainly involve~
5
Residuals vs Fitted and Residuals with Fitted Quantities: These charts aid in evaluating the
regression model's linearity assumption. The Residuals versus Fitted plot depicts the connection
between the observed and anticipated residual values. The linearity assumption is supported by a
random scatter of points around the horizontal line. Any discernible patterns or trends indicate a
violation of linearity.
Standardised residuals and Q-Q residuals using Theoretical Quantities: Q-Q (Quantile-
Quantile) plots compare the residuals (or standardised residuals) distribution to a theoretical
normal distribution. If the points fall along a straight line, the residuals have a normal
distribution, which is ideal for regression analysis. Deviations from the straight line indicate
deviations from normalcy.
Square root and Scale-Location Fitted residuals with standardised residuals: The Scale-
Location graphic aids in evaluating the residuals' constant variance assumption. The points
should ideally be randomly distributed along a horizontal line, demonstrating consistent
variance. If the points form a funnel-shaped or fan-shaped pattern, this indicates
heteroscedasticity, which occurs when the variance of residuals varies within the expected value
range.
Residuals vs. Leverage and Standardised residuals with Leverage: These graphs are used to
identify data outliers or influential observations. This graphic compares the leverage of each
observation to the standardised residuals. Points outside of the Cook's distance or with high
leverage values may have a significant impact on the regression model. They could be influential
observations that have a disproportionate effect on the model's parameter fit.
6
Figure 5- Regression output of Fit-1
7
Figure 6- Regression output of Fit-2
8
Figure 7- Regression output of Fit-3
Fit-1 more or less follows linearity condition until distance covered begins to increase and
disproportionate with airfares. All three fits fall along a straight line in Q-Q analysis indicating a
desirable normal distribution for the regression. The heteroscedasticity of the regression
equations is consistent with respect to variance of residuals differing across the range of
predicted values. The leverage plotting indicates all the outliers w.r.t cook’s distance (or high
leverage values) that may have substantial impact on the model w.r.t affecting the fit and the
disproportionality. Relatively Fit3 and Fit2 are showing less leverage w.r.t Fit1.
9
Fit1 is the best-fit equation in the simple regression model; a brief assessment of its significance
in respect to the data is-
The best-fit equation, FARE = Beta0 + Beta1 * DISTANCE, reveals that the distance between
the beginning and terminating airports determines the fare for a given flight. According to the
equation, the fare tends to rise as the distance rises. This distance-to-fare association is
statistically significant and has a considerable impact
on predicting fare values. The coefficient Beta1
shows the projected fare increase per unit increase in
distance. If Beta1 is positive, it means there is a
positive linear relationship between distance and
fare, implying that longer flights have higher fares.
The R-squared value assigned to the best-fit equation
indicates how well the equation fits the data. A
higher R-squared value indicates that the distance
variable alone may explain a greater fraction of the
variance in fare.
10
Determining the Best-Fit Multiple Regression equation
When comparing multiple goodness-of-fit measurements such as coefficient of determination (R-
squared), adjusted R-squared (as in simple regression); F-statistic, and p-values of the
coefficients. These measurements evaluate how well the regression equations fit the data and
whether the predictors included are statistically significant.
The lm() function was ran into the R script window with two estimated multiple regression
equations fitting equations ‘Fit4’ and ‘Fit5’ correlating all the variables and one integrating
FARE with COUPONS + NEW_CARRIERS + VACATION_ROUTE + SW +
HERFINDAHL_INDEX + SLOT_CONTROLLED + GATE_CONSTRAINTS + DISTANCE +
PASSENGERS respectively. Summary and plot functions were used to visualize and derive the
insights of the regression models.
11
Figure 11- Multiple Regression equation (Fit5)
12
value numdf dendf
227.9192 9.0000 628.0000
>
> # Compare coefficients and p-values
> summary(fit4)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.269941e+01 2.737943e+01 0.4638304 6.429310e-01
COUPON 3.754886e+00 1.219407e+01 0.3079271 7.582406e-01
NEW -2.395530e+00 1.875424e+00 -1.2773275 2.019617e-01
VACATIONYes -3.564444e+01 3.617050e+00 -9.8545606 2.167018e-21
SWYes -4.096960e+01 3.743729e+00 -10.9435262 1.281160e-25
HI 8.425789e-03 9.900663e-04 8.5103283 1.289852e-16
S_INCOME 1.206678e-03 5.171071e-04 2.3335163 1.993775e-02
E_INCOME 1.374273e-03 3.749187e-04 3.6655231 2.678492e-04
S_POP 3.400946e-06 6.523493e-07 5.2133825 2.524752e-07
E_POP 4.363124e-06 7.546959e-07 5.7813015 1.171808e-08
SLOTFree -1.624477e+01 3.846880e+00 -4.2228428 2.771703e-05
GATEFree -2.057923e+01 4.001584e+00 -5.1427704 3.628879e-07
DISTANCE 7.498426e-02 3.579549e-03 20.9479631 3.449869e-74
PAX -8.709429e-04 1.459072e-04 -5.9691543 4.001410e-09
> summary(fit5)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.236005e+02 1.767196e+01 6.99415618 6.852359e-12
COUPON 4.826680e-01 1.244335e+01 0.03878922 9.690708e-01
NEW -2.158421e+00 1.954702e+00 -1.10421978 2.699207e-01
VACATIONYes -4.617948e+01 3.437611e+00 -13.43359749 2.400741e-36
SWYes -4.786679e+01 3.663881e+00 -13.06450659 1.120519e-34
HI 8.293724e-03 1.003679e-03 8.26332518 8.425190e-16
SLOTFree -2.778110e+01 3.689434e+00 -7.52990841 1.773135e-13
GATEFree -2.781183e+01 4.035645e+00 -6.89154645 1.345540e-11
DISTANCE 8.084848e-02 3.610193e-03 22.39450393 4.262026e-82
PAX -3.097691e-04 1.326554e-04 -2.33514235 1.985004e-02
Comparing R-squared and adjusted R-squared values- The R-squared value for 'fit1' is
0.7867705, indicating that the independent variables in the model explain about 78.7% of the
variance in the dependent variable (fare). The R-squared value for 'fit2' is 0.7656081, suggesting
that the independent variables in the model explain about 76.6% of the variance in the dependent
variable. The adjusted R-squared numbers are identical to the R-squared values, but they take the
number of predictors in the model into consideration.
Comparing F-statistics- The F-statistic assesses the model's overall relevance. A higher F-
statistic suggests a more accurate fit. The F-statistic for 'fit1' is 177.1096, and 227.9192 for 'fit2'.
A lower F-statistic for 'fit1' indicates that it produces a weaker fit than 'fit2'.
Comparing coefficients and p-values- The coefficients represent an estimate of the link
between each independent variable and the dependent variable (fare). The direction and strength
of the relationship are indicated by the sign (positive or negative) and magnitude of the
coefficients.
The variables in 'fit1' that have statistically significant coefficients at the 0.05 significance level
(p-value 0.05) are VACATION, SW, HI, S_INCOME, E_INCOME, S_POP, E_POP, SLOT,
GATE, and DISTANCE. These factors have a considerable impact on the fare.
13
VACATION is the variable in 'fit2' having statistically significant coefficients at the 0.05
significance level.
- Simple Regression: The best-fit equation in the simple regression model reveals that
distance and fare have a positive linear relationship. The distance coefficient suggests
that the fare increases by approximately 0.0749 units for each unit increase in distance.
The R-squared value of 0.7867705 implies that the distance variable alone can explain
approximately 78.7% of the variation in fare. The analysis provides a simplified view of
the link between distance and fare, but other potential factors that may influence airfares
should be considered.
- Multiple Regression: In addition to distance, the multiple regression analysis includes
many factors, resulting in a more thorough model for forecasting fares. VACATION
(Vacation route(s)) is a significant predictor that has a statistically significant impact on
pricing. These predictors' coefficients show the size and direction of their influence on
fare. The modified R-squared values (0.7823282 for 'fit1' and 0.7622489 for 'fit2')
indicate that the various predictors in each model can explain about 78.2% and 76.2% of
the variation in fare, respectively.
14
Airline Reputation: Including a variable that captures the reputation or customer
satisfaction ratings of many airlines operating on a route may provide insights into how
airline reputation affects pricing. Higher-rated airlines may be able to command higher
fares, whilst lower-rated airlines may be able to attract customers by offering lower fares.
Economic statistics: Including economic statistics such as GDP growth rate, inflation
rate, or unemployment rate for the starting and terminating cities may aid in capturing
economic conditions and their impact on airfares. Economic variables can have an impact
on travel demand as well as passengers' willingness to pay for air travel.
Flight Schedule Flexibility: Including a measure of flight schedule flexibility, such as
the number of flight alternatives or frequencies available on a route, may have an impact
on pricing. Increased competition and therefore reduced rates may result from more
flexible schedules with frequent flights.
Cost-friendly variables that can be additionally added to data-set can potentially be-
Seasonality: By including a variable indicating the season or month of the year when the
flight was taken, the impact of seasonality on fares might be captured. Seasonal changes,
such as peak travel periods or vacation seasons, can have an impact on tariff levels.
Fuel Prices: Including average fuel prices during the data collection period as a variable
may aid in accounting for the impact of fuel costs on airfares. Fuel price fluctuations can
have an impact on airline operational costs and, as a result, fares.
15