You are on page 1of 15

Predicting Airfare on New Routes

BSF Case Analysis Submission

Team: 10
Group – 1

Anirudh G. (2201006)
Shimona Francis (2201218)
Arfa Aliya (2201086)
Radhika Rastogi (2201050)
Navya Sree Naragani (2202122)
Gautam M (2201102)

1
Table of Contents
Case Overview.................................................................................................................................3
Preliminary Data Examination in R Studio.....................................................................................3
Exploratory Data Analysis and Insights from EDA........................................................................4
Insight-1:......................................................................................................................................4
Insight-2:......................................................................................................................................4
Insight-3:......................................................................................................................................4
Identifying Best-fit Simple Regression equation.............................................................................5
Visualizing the best-fit.................................................................................................................8
Determining the Best-Fit Multiple Regression equation...............................................................10
Interpretation of the multiple regression results........................................................................11
Conclusions and Recommendations..............................................................................................13
Potential factors to model improvement........................................................................................13

List of Figures
Figure 1- Summary statistics of the Airfares data...........................................................................3
Figure 2- Distribution of the airfares (frequency)...........................................................................4
Figure 3- Variation in Airfare distribution w.r.t presence of SW airlines.......................................4
Figure 4- Scatter plot of Fares vs. Distance covered in route(s).....................................................5
Figure 5- Regression output of Fit-1...............................................................................................6
Figure 6- Regression output of Fit-2...............................................................................................7
Figure 7- Regression output of Fit-3...............................................................................................8
Figure 8- Identifying the Best fit equation......................................................................................9
Figure 9- Scatter plot- Best fit of Simple Regression equation.......................................................9
Figure 10- Multiple Regression equation (Fit4)............................................................................10
Figure 11- Multiple Regression equation (Fit5)............................................................................11

2
Case Overview
Due to airline deregulation in the late 1990s, major US cities experienced airport congestion.
Southwest Airlines rose to prominence as a low-cost carrier, competing on existing routes but
also establishing nonstop service. Converting retired military bases or smaller airports into
regional or commercial airports was considered to alleviate congestion. A consulting firm in
aviation sought advisory contracts to assist the many stakeholders involved. They required
predictive models to support their consulting services, which included forecasting fares if new
airports were built. From Q3-1996 to Q2-1997, the firm examined real-world data, taking into
account variables such as route characteristics, market concentration, income, population,
congestion measures, distance, passengers, and average rates. The influence of Southwest
Airlines' presence on fares was a significant component of the investigation.

Preliminary Data Examination in R Studio


The .csv file of Airfares data was fed into R studio through the script window and the data was
examined with respect to its structure, names, headers, arrangement & organization in its rows
and columns and the summary statistics of all the variables involved. As mentioned in the case,
R identified 638 observations in total including 18 variables like city codes, fare and coupons,
new carriers, route specs, demographic and passenger limit data, and distance covered. The
summary stats defined the basic and preliminary analysis of these datasets computing the mean,
median, range and quartiles of different variables. For example, the average income of starting
and ending cities were around $27700, the maximum distance covered in a route is 2764miles
and the median fare of the datasets falls about $144.6 for a trip.

3
Figure 1- Summary statistics of the Airfares data

Exploratory Data Analysis and Insights from EDA


EDA mostly involves in generating descriptive statistics and data visualizations establishing
correlations and relationships among the parameters. The insights are derived from plotting and
observing these trends and directions.
The major visualization(s) were coded to observe the distribution of fares through histogram, box
plotting the variations if presence of Southwest airlines is there, scatter-plotting fare distribution
w.r.t distance covered in the route(s) etc. The observations and results are as follows-

Insight-1:
The distribution of the fares w.r.t their
frequency in the data seems to be
skewed to its right with the higher
side of fares relatively appearing
lesser often w.r.t the medium and low-
fare datasets.

Insight-2:
Analyzing the boxplot of fares w.r.t
presence of Southwest Airlines to
identify if any noticeable difference is
seen w.r.t the range(s). As it is evident
from Figure 2, there is substantial

4
Figure 2- Distribution of the airfares (frequency)
difference in the fare slabs with presence and absence of Southwest airlines. With presence of
SW, apart from few outliers, majority of the fare(s) fall within a $125 slab whereas without the
SW presence in certain route(s), the slab takes off only from a $130 slab and ranges up to $240
airfares as visible. This suggests that the presence of SW airlines might have a significant impact
on the airfares.

Insight-3:
Examining the scatter plot of fares vs.
distance covered in the routes denotes
the relationship between the two
variables has a clear trend in the short-
distance route(s) slab as it is evident
from figure 4. The trend begins to
scatter and the proportionality is not
straight indicating a partial direct
correlation in overall senses, i.e., lower Figure 3- Variation in Airfare distribution w.r.t presence
fares for shorter distances whereas the of SW airlines
vice-versa does not have a clear-cut
denotion.

These insights or correlation(s) are few


illustrations that were attempted.
Different other variables and their
relationships can be explored with R Studio to gain much more insights in addition to investigate
the patterns in the model. Fares and its distribution w.r.t distances and SW presence were
relatively significant than other possible choices available.

Figure 4- Scatter plot of Fares vs. Distance covered in route(s)

Identifying Best-fit Simple


Regression equation
In order to examine goodness-of-fit measurements such as the coefficient of determination (R-
squared), adjusted R-squared, or other relevant statistics in decision to which of the three
calculated simple regression equations gives the best fit. The indicators evaluate how well the
regression equation fits the datasets fed through the .csv file.

Three regression analyses were coded and run on all three equations to compare their goodness-
of-fit measurements and choose the one with the greatest R-squared or adjusted R-squared value,
which indicates a better match to the data. Performing the simple regression involves singleton
variables in action to estimate the correlations and relationship. Three fits were considered
including ‘Fare-Distance’, ‘Fare-Number of New Carriers’, ‘Fare-HI value (Herfindahl Index)’.
Summary and plot functions were used to visualize and derive the insights of the regression
models. The results for each of the equation mainly involve~

5
Residuals vs Fitted and Residuals with Fitted Quantities: These charts aid in evaluating the
regression model's linearity assumption. The Residuals versus Fitted plot depicts the connection
between the observed and anticipated residual values. The linearity assumption is supported by a
random scatter of points around the horizontal line. Any discernible patterns or trends indicate a
violation of linearity.

Standardised residuals and Q-Q residuals using Theoretical Quantities: Q-Q (Quantile-
Quantile) plots compare the residuals (or standardised residuals) distribution to a theoretical
normal distribution. If the points fall along a straight line, the residuals have a normal
distribution, which is ideal for regression analysis. Deviations from the straight line indicate
deviations from normalcy.

Square root and Scale-Location Fitted residuals with standardised residuals: The Scale-
Location graphic aids in evaluating the residuals' constant variance assumption. The points
should ideally be randomly distributed along a horizontal line, demonstrating consistent
variance. If the points form a funnel-shaped or fan-shaped pattern, this indicates
heteroscedasticity, which occurs when the variance of residuals varies within the expected value
range.

Residuals vs. Leverage and Standardised residuals with Leverage: These graphs are used to
identify data outliers or influential observations. This graphic compares the leverage of each
observation to the standardised residuals. Points outside of the Cook's distance or with high
leverage values may have a significant impact on the regression model. They could be influential
observations that have a disproportionate effect on the model's parameter fit.

6
Figure 5- Regression output of Fit-1

7
Figure 6- Regression output of Fit-2

(PLEASE TURN OVER)

8
Figure 7- Regression output of Fit-3

Fit-1 more or less follows linearity condition until distance covered begins to increase and
disproportionate with airfares. All three fits fall along a straight line in Q-Q analysis indicating a
desirable normal distribution for the regression. The heteroscedasticity of the regression
equations is consistent with respect to variance of residuals differing across the range of
predicted values. The leverage plotting indicates all the outliers w.r.t cook’s distance (or high
leverage values) that may have substantial impact on the model w.r.t affecting the fit and the
disproportionality. Relatively Fit3 and Fit2 are showing less leverage w.r.t Fit1.

Visualizing the best-fit


The equation with the highest R-squared or adjusted R-squared value is the best fit to the data. A
scatter plot can be generated to visualize the relationship between the predictor variable and the
response variable once the best-fit equation has been established.

9
Fit1 is the best-fit equation in the simple regression model; a brief assessment of its significance
in respect to the data is-

The best-fit equation, FARE = Beta0 + Beta1 * DISTANCE, reveals that the distance between
the beginning and terminating airports determines the fare for a given flight. According to the
equation, the fare tends to rise as the distance rises. This distance-to-fare association is
statistically significant and has a considerable impact
on predicting fare values. The coefficient Beta1
shows the projected fare increase per unit increase in
distance. If Beta1 is positive, it means there is a
positive linear relationship between distance and
fare, implying that longer flights have higher fares.
The R-squared value assigned to the best-fit equation
indicates how well the equation fits the data. A
higher R-squared value indicates that the distance
variable alone may explain a greater fraction of the
variance in fare.

It is crucial to highlight, however, that this


interpretation focuses solely on the relationship Figure 8- Identifying the Best fit equation
between distance and fare, ignoring other potential factors that may influence airfares, such as
airline rivalry, market circumstances, and demand-supply dynamics. As a result, while this best-
fit equation provides a simple comprehension of the distance-fare relationship, additional
variables should be addressed for a deeper analysis.

Figure 9- Scatter plot- Best fit of Simple Regression equation

10
Determining the Best-Fit Multiple Regression equation
When comparing multiple goodness-of-fit measurements such as coefficient of determination (R-
squared), adjusted R-squared (as in simple regression); F-statistic, and p-values of the
coefficients. These measurements evaluate how well the regression equations fit the data and
whether the predictors included are statistically significant.

The lm() function was ran into the R script window with two estimated multiple regression
equations fitting equations ‘Fit4’ and ‘Fit5’ correlating all the variables and one integrating
FARE with COUPONS + NEW_CARRIERS + VACATION_ROUTE + SW +
HERFINDAHL_INDEX + SLOT_CONTROLLED + GATE_CONSTRAINTS + DISTANCE +
PASSENGERS respectively. Summary and plot functions were used to visualize and derive the
insights of the regression models.

Figure 10- Multiple Regression equation (Fit4)

11
Figure 11- Multiple Regression equation (Fit5)

Interpretation of the multiple regression results


Results are:
> summary(fit4)$r.squared
[1] 0.7867705
> summary(fit5)$r.squared
[1] 0.7656081
>
> # Compare adjusted R-squared values
> summary(fit4)$adj.r.squared
[1] 0.7823282
> summary(fit5)$adj.r.squared
[1] 0.7622489
>
> # Compare F-statistics and p-values
> summary(fit4)$fstatistic
value numdf dendf
177.1096 13.0000 624.0000
> summary(fit5)$fstatistic

12
value numdf dendf
227.9192 9.0000 628.0000
>
> # Compare coefficients and p-values
> summary(fit4)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.269941e+01 2.737943e+01 0.4638304 6.429310e-01
COUPON 3.754886e+00 1.219407e+01 0.3079271 7.582406e-01
NEW -2.395530e+00 1.875424e+00 -1.2773275 2.019617e-01
VACATIONYes -3.564444e+01 3.617050e+00 -9.8545606 2.167018e-21
SWYes -4.096960e+01 3.743729e+00 -10.9435262 1.281160e-25
HI 8.425789e-03 9.900663e-04 8.5103283 1.289852e-16
S_INCOME 1.206678e-03 5.171071e-04 2.3335163 1.993775e-02
E_INCOME 1.374273e-03 3.749187e-04 3.6655231 2.678492e-04
S_POP 3.400946e-06 6.523493e-07 5.2133825 2.524752e-07
E_POP 4.363124e-06 7.546959e-07 5.7813015 1.171808e-08
SLOTFree -1.624477e+01 3.846880e+00 -4.2228428 2.771703e-05
GATEFree -2.057923e+01 4.001584e+00 -5.1427704 3.628879e-07
DISTANCE 7.498426e-02 3.579549e-03 20.9479631 3.449869e-74
PAX -8.709429e-04 1.459072e-04 -5.9691543 4.001410e-09
> summary(fit5)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.236005e+02 1.767196e+01 6.99415618 6.852359e-12
COUPON 4.826680e-01 1.244335e+01 0.03878922 9.690708e-01
NEW -2.158421e+00 1.954702e+00 -1.10421978 2.699207e-01
VACATIONYes -4.617948e+01 3.437611e+00 -13.43359749 2.400741e-36
SWYes -4.786679e+01 3.663881e+00 -13.06450659 1.120519e-34
HI 8.293724e-03 1.003679e-03 8.26332518 8.425190e-16
SLOTFree -2.778110e+01 3.689434e+00 -7.52990841 1.773135e-13
GATEFree -2.781183e+01 4.035645e+00 -6.89154645 1.345540e-11
DISTANCE 8.084848e-02 3.610193e-03 22.39450393 4.262026e-82
PAX -3.097691e-04 1.326554e-04 -2.33514235 1.985004e-02

This can be interpreted (referring to BS course)~

Comparing R-squared and adjusted R-squared values- The R-squared value for 'fit1' is
0.7867705, indicating that the independent variables in the model explain about 78.7% of the
variance in the dependent variable (fare). The R-squared value for 'fit2' is 0.7656081, suggesting
that the independent variables in the model explain about 76.6% of the variance in the dependent
variable. The adjusted R-squared numbers are identical to the R-squared values, but they take the
number of predictors in the model into consideration.

Comparing F-statistics- The F-statistic assesses the model's overall relevance. A higher F-
statistic suggests a more accurate fit. The F-statistic for 'fit1' is 177.1096, and 227.9192 for 'fit2'.
A lower F-statistic for 'fit1' indicates that it produces a weaker fit than 'fit2'.

Comparing coefficients and p-values- The coefficients represent an estimate of the link
between each independent variable and the dependent variable (fare). The direction and strength
of the relationship are indicated by the sign (positive or negative) and magnitude of the
coefficients.
The variables in 'fit1' that have statistically significant coefficients at the 0.05 significance level
(p-value 0.05) are VACATION, SW, HI, S_INCOME, E_INCOME, S_POP, E_POP, SLOT,
GATE, and DISTANCE. These factors have a considerable impact on the fare.

13
VACATION is the variable in 'fit2' having statistically significant coefficients at the 0.05
significance level.

Conclusions and Recommendations


Based on the above analyses, the final conclusions drawn shall include-

- Simple Regression: The best-fit equation in the simple regression model reveals that
distance and fare have a positive linear relationship. The distance coefficient suggests
that the fare increases by approximately 0.0749 units for each unit increase in distance.
The R-squared value of 0.7867705 implies that the distance variable alone can explain
approximately 78.7% of the variation in fare. The analysis provides a simplified view of
the link between distance and fare, but other potential factors that may influence airfares
should be considered.
- Multiple Regression: In addition to distance, the multiple regression analysis includes
many factors, resulting in a more thorough model for forecasting fares. VACATION
(Vacation route(s)) is a significant predictor that has a statistically significant impact on
pricing. These predictors' coefficients show the size and direction of their influence on
fare. The modified R-squared values (0.7823282 for 'fit1' and 0.7622489 for 'fit2')
indicate that the various predictors in each model can explain about 78.2% and 76.2% of
the variation in fare, respectively.

Few recommendations based on these results may include-


 Improve vacation-based routes: Encourage airline competition by attracting new
carriers to enter the market on vacation routes. This can help to offset the disadvantages
of market concentration while also increasing revenue from interested niche clients.
 Emphasize a collaborative strategy that includes extensive collaboration with the players
to understand their goals, obstacles, and unique circumstances. Developing good
relationships and actively engaging stakeholders in decision-making will boost the
likelihood of obtaining advisory contracts.
 Highlight Previous Successful Engagements and Case Studies: Highlight previous
successful engagements and case studies where the firm gave useful insights and assisted
clients in overcoming similar difficulties. Show the firm's ability to provide measurable
results and positive consequences.
 Networking and Outreach: Attend industry conferences, events, and networking
opportunities to make connections and cultivate relationships with future clients.
Proactively seek out to important aviation decision-makers, such as airline executives,
airport authorities, and government officials, to display the firm's capabilities and discuss
future partnership.

Potential factors to model improvement


The following three variables/factors that are not present in the data but could potentially
improve the model:

14
 Airline Reputation: Including a variable that captures the reputation or customer
satisfaction ratings of many airlines operating on a route may provide insights into how
airline reputation affects pricing. Higher-rated airlines may be able to command higher
fares, whilst lower-rated airlines may be able to attract customers by offering lower fares.
 Economic statistics: Including economic statistics such as GDP growth rate, inflation
rate, or unemployment rate for the starting and terminating cities may aid in capturing
economic conditions and their impact on airfares. Economic variables can have an impact
on travel demand as well as passengers' willingness to pay for air travel.
 Flight Schedule Flexibility: Including a measure of flight schedule flexibility, such as
the number of flight alternatives or frequencies available on a route, may have an impact
on pricing. Increased competition and therefore reduced rates may result from more
flexible schedules with frequent flights.

Cost-friendly variables that can be additionally added to data-set can potentially be-

 Seasonality: By including a variable indicating the season or month of the year when the
flight was taken, the impact of seasonality on fares might be captured. Seasonal changes,
such as peak travel periods or vacation seasons, can have an impact on tariff levels.
 Fuel Prices: Including average fuel prices during the data collection period as a variable
may aid in accounting for the impact of fuel costs on airfares. Fuel price fluctuations can
have an impact on airline operational costs and, as a result, fares.

15

You might also like