You are on page 1of 20

Methodology

As mentioned in the background section, the main objectives of this research are to find the
impact of the different factors that impact customer satisfaction in the airline industry and how
we can increase the satisfaction rate in the future. So, to answer these questions a set of steps
have been implemented:

A. Data Collection
The data that have been used in this research is the airline passenger satisfaction dataset from
Kaggle. The data considered a survey that investigates the different factors for customer
satisfaction and could be found on the link. The dataset consists of two files; training which
would be used in both training and validation and the other file is the testing file which would be
used in the final testing phase.

“Fig 1: Sample of the data”

B. Data Cleaning
Before starting the analysis and modeling, the data should be clean and well-structured. So set
of sub-steps done to ensure the data is clean:
B.1 - Impute the missing values in any of the features: As could be seen in Figure Fig 2 all the
features are clean except the “Arrival.Delay.in.Minutes” feature that contains 310 missing
values. So, these missing values have been imputed with the mean value of this feature and
that could be asserted in Figure Fig 3 that all the features now are clean from any of the
missing.
“Fig 2: The number of missing values in each feature before imputing”

“Fig 3: The number of missing values in each feature after imputing”

B.2- Assert that not any of the features contain any abnormal value by getting the unique values
of each feature. And as could be seen in the figure below there is no abnormal value could be
found in any of the features.
“Fig 4: Assertion that there no abnormal values in the features”

B.3- Converting the categorical string features into dummy coded numerical features. This step
gets done before the modeling as most of the machine learning models couldn’t be used with
the strings. So, as could be seen in the below figure all the categorical features have been
converted into numerical features.

“Fig 5: Sample of the data after the conversion of the categorical features into numerical”

C. Exploratory Data Analysis (EDA)


Now, the data is ready for the analysis that would help in answering the research questions. So,
the type of analysis that would be done is multi-variate analysis in which the relationship
between each factor (features) and the dependent variable (satisfaction feature) and is also the
relationship between each pair of variables (correlation matrix).
C.1- The first graph in Figure Fig 5 shows the relationship between customer age and customer
satisfaction. As could be seen the percentage of satisfied customers increases in the customers
whose age is between 40-60 years. The customers who are below the age of 20 tend to be not
satisfied the same case with the customer whose age between 60-70 years. This shows that the
age of customers one of the factors that impact customer satisfaction and that would be
highlighted in the findings section.

“Fig 6: The relationship between age and customer satisfaction”

C.2- As could be noticed from the below figure most of the customers in the business class tend
to be highly satisfied with airline services while the majority of the customers in both classes;
economy and economy plus seem to be less satisfied with service. This shows that the airline
class one of the factors that highly impact customer satisfaction.

“Fig 7: The relationship between airline class and customer satisfaction”


C.3- The below graph shows the relationship between the gender of the customer and his/her
satisfaction with the airline services. As could be seen both groups almost have the same
percentage of satisfaction and dissatisfaction. This indicates that the gender factor is not
significant enough and does not impact customer satisfaction.

“Fig 8: The relationship between airline class and customer satisfaction”

C.4- The graph below shows the relationship between wifi service in the flight and customer
satisfaction. Here a strange pattern could be observed as almost all the customers who gave 0
for the wifi service was satisfied with this abnormal behavior because the normal behavior is
what we can see for the customer who gave the rates 1, 2, and 3 the majority of them were not
satisfied while the majority of the customer who rated the wifi service with 4 and 5 were
satisfied. So, it should be there is a reason behind those satisfied customers who suffered from
the wifi service. But anyway, the wifi service as shown one of the most effective factors on the
customers’ satisfaction.

“Fig 9: The relationship between wifi service in the flight and customer satisfaction”
C.5- The below graph follows almost the same pattern as the previous graph. As we could see
the customers who gave the ease of online booking 0 rates the majority of them tend to be
satisfied which is abnormal behavior as we have shown earlier.

“Fig 10: The relationship between ease of online booking and customer satisfaction”

C.6- The below graph follows almost the same pattern as the previous graph. As we could see
all the customers who unliked the gate location with the rate of 0 were 100% satisfied which is
abnormal behavior as we have shown earlier in the previous two graphs. Now, that could be
clearer as that behavior occurred with other variables with intuition we could guess that that
customer did not read the survey anymore and they only select the first option to only complete
it quickly. If this is the case those customers could be really misleading in either the EDA or the
predictive analysis but initially, we would leave them in the data until making sure from the data
provider if the customers who have been surveyed were forced to take the survey for some
reason so, some of them preferred to complete it as quickly as they can without reading it.
Anyway, the location of the gate is. After all, also considered one of the most important factors
that should be taken into consideration in customer satisfaction.
“Fig 11: The relationship between the gate location and customer satisfaction”

C.7- The following relationship is that the relationship between the type of travel and customer
satisfaction. As could be seen from the below figure that the customer on the business travel
tends more to be satisfied than the customers on the personal travel whose majority tend more
to be dissatisfied with the airline services.

“Fig 12: The relationship between the type of the travel and customer satisfaction”
C.8- In the below relationship we could find that the customers who got a good check-in
experience tend to be more satisfied while the customers who got troubles in the check-in
service the majority of them was not satisfied. This graph disproves the intuition that we
proposed about the graphs in the figure from Fig 9 to Fig 11. This because as we can see in the
below graph in figure Fig 13 all the customers who gave 0 rates for the check-in service were
not satisfied which shows that they must read the survey carefully before selecting the options.
There are a set of other variables that follow the seem behavior which is the normal behavior
such as (cleanliness, On-board service, seat comfort, and flight entertainment)

“Fig 13: The relationship between the check-in service and customer satisfaction”

C.9- The graph below studies the important relationship between flight distance and customer
satisfaction. It shows that in the short-distance flights (0-1000 km) the customer tend to be not
satisfied while in the long-distance service (2000-4000 km) the customer satisfaction reaches to
the highest percentages.

“Fig 14: The relationship between the flight distance and customer satisfaction”

C.10- In the below graph the relationship between the drink and food in the flight and the
customer satisfaction. As noticed and unlike I have predicated the food and drink not one of the
most effective factors for customer satisfaction as the percentage of the customer in each rate
for the food and drink is converged that indicates the unimportant of this factor in impaction the
customer satisfaction.

“Fig 15: The relationship between the food and drink in the flight and customer satisfaction”

C.11- The below graph study how each group of the customers are satisfied. And as could be
seen the loyal customer tend more to be satisfied than the disloyal customer which seems to be
a logical relationship as the loyal customers had been loyal because they are mostly satisfied,
unlike the disloyal customers who become disloyal because they are mostly not satisfied with
the airline service.

“Fig 16: The relationship between the customer type and customer satisfaction”

C.12- Now we would go to another type of analysis which is the correlation analysis. In the
correlation analysis, the correlation coefficient calculated between each pair of variables and
indicates how strong the relationship between these two variables. Mostly we could say there is
a strong relationship between two variables if it between 0.5 and 1 is called a strong positive
correlation and if it between -1 and -0.5 it called a strong negative correlation. So, by looking
into the graph below we can see that there a strict positive correlation between the departure
delay in minutes and the arrival delay minutes which means the more delayed minutes in
departure the more delayed minutes in arrival which seems logical and we could later in
modeling using only on of them to for reducing the dimensionality. Also, we could find a strong
correlation between the four variables seat comfort, food, and drink, cleanliness, and inflight
entertainment this also seems to be a logical relationship as the four variables related with each
other as if the seat is comfortable, the food and drink are good the customer surely would be e
amused. Also there a negative correlation between the departure&Arrival time and the gate
location. Also, there is a negative correlation between age and customer type which means the
higher the age the more the customer is loyal.

“Fig 17: The correlation matrix”

D. Modeling
In this part, the predictive modeling would be utilized for improving the customer satisfaction
rates. As the predictive modeling would help in early prediction if the customer would be
satisfied with the airline services or not based on the historical data. This would help in
proposing a more personalized experience especially for the customers with a higher probability
to be not satisfied. So, two of the highly known machine learning models have been used in a
comparative study to select one of them as the winner model for the testing data. The two
models are:
D.1- The Decision tree model: a simple model that utilizes the tree structure in getting the
decision by splitting the features into different levels of connected nodes until reaching the final
level which is the final decision.
D.2- The Randomforest: it is an ensembling model that uses n set of decision trees in getting
the final decision as it uses a mechanism called voting mechanism in which each decision tree
tells its decision and the common decision would be the final decision.

There are two steps are done before applying the models:
1- Splitting the training data into training (70%) and validation (30%)
2- asserting the data is balanced through getting the distribution of dependent variables
(satisfaction).

Two versions were used for each mode one before getting the importance of the features and
one after getting the feature importance and training the model with the 10 most informative
features only and the results could be found in the table Table-1.

“Fig 18: Most ten informative features”

Model Accuracy

Decision tree before feature engineering 0.88

Decision tree after feature engineering 0.88

Randomforest before feature engineering 0.96


Decision tree before feature engineering 0.94
“Table-1: Comparison between the models’ accuracy”

The default randomforest model generates great results on the validation data with 96%
accuracy and 94% accuracy on the final testing data which is so great results.

Findings
In light of what we have seen in the methodology, we now have the ability to answer the
research questions. So, first of all, there are some critical factors that impact customer
satisfaction, these factors are online boarding, wifi service in the flight, the class of the flight
(economy, business, etc.), the type of travel (personal or business), entertainment in the flight,
type of the customer (loyal or disloyal), the comfort of the seat, age and the distance of the
flight. Some of these factors the airline company cannot control like the customer age, distance
of the flight, type of travel, etc. But still there a controlled factors such as the entertainment on
the flight, online boarding, wifi service, comfort of the seat, etc. So, if the company gets more
due for these controlled factors or even a set of them that would surely contribute to increasing
the customer satisfaction rates which in turn would raise the percentage of loyal customers and
increase the overall airline company revenue. The second part of this research was about
improving the overall process after identifying the factors that highly impact customer
satisfaction we needed to build a robust model that could early predict if the customer would be
satisfied based on how he/she see the quality of the services and also based on different
characteristics like his/her age, travel distance, type of travel, type of customer, etc. So, we
were able to build a robust and accurate model that able to detect the customer stratification
using the different factors we mentioned earlier the model detected correctly 94% of the
customer satisfaction opinions on the testing data and 96% on the validation data which is really
good accuracy that’s ready to be put on the production phase. Using the machine learning
model on the testing data it was able to predict that 56% of the customers would be satisfied
with the airline service while 44% of the customers would end up being unsatisfied which is
really not good and tell us there is work needs to be done to improve the customer journey. But
being able to early detect how would be the distribution of the satisfied and unsatisfied
customers would help the company to know how far away from reaching an acceptable
percentage of satisfied customers also they know what is factors that if improved or fixed could
help in improving the satisfaction rates.

Appendix
## -----------------------------------------------------------------------------------------------------------
#Reading the dataset
training <- read.csv(file.choose(), header = TRUE)
#'
## -----------------------------------------------------------------------------------------------------------
#showing a sample of the data
head(training)

## -----------------------------------------------------------------------------------------------------------
sapply(training,function(y)unique(y))

#'
## -----------------------------------------------------------------------------------------------------------
#Checking nulls in each column in the data
sapply(training, function(x) sum(is.na(x)))

#'
## -----------------------------------------------------------------------------------------------------------
#Replacing null values with mean value for each feature in the data
#install.packages('imputeTS')
library(imputeTS)
training <- na_mean(training)
training

## -----------------------------------------------------------------------------------------------------------
#Checking nulls in each column in the data
sapply(training, function(x) sum(is.na(x)))

#'
#'
## -----------------------------------------------------------------------------------------------------------
#get the relationship between each variable and the dependent variable
library(ggplot2)
library(dplyr)
AgePlot<-ggplot(training)
+geom_bar(mapping=aes(x=Age,fill=satisfaction),position="fill",width=0.8)

AgePlot

## -----------------------------------------------------------------------------------------------------------
DepPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Departure.Delay.in.Minutes,fill=satisfaction),position="fill",width=0.
8)
DepPlot

## -----------------------------------------------------------------------------------------------------------
GenderPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Gender,fill=satisfaction),position="fill",width=0.8)
GenderPlot

## -----------------------------------------------------------------------------------------------------------
classPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Class,fill=satisfaction),position="fill",width=0.8)
classPlot

## -----------------------------------------------------------------------------------------------------------
Ease.of.Online.bookingPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Ease.of.Online.booking,fill=satisfaction),position="fill",width=0.8)
Ease.of.Online.bookingPlot

## -----------------------------------------------------------------------------------------------------------
Online.boardingPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Online.boarding,fill=satisfaction),position="fill",width=0.8)
Online.boardingPlot

## -----------------------------------------------------------------------------------------------------------
CleanlinessPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Cleanliness,fill=satisfaction),position="fill",width=0.8)
CleanlinessPlot

## -----------------------------------------------------------------------------------------------------------
Seat.comfortPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Seat.comfort,fill=satisfaction),position="fill",width=0.8)
Seat.comfortPlot

## -----------------------------------------------------------------------------------------------------------
Customer.TypetPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Customer.Type,fill=satisfaction),position="fill",width=0.8)
Customer.TypetPlot

## -----------------------------------------------------------------------------------------------------------
Flight.DistancePlot<-ggplot(training)
+geom_bar(mapping=aes(x=Flight.Distance,fill=satisfaction),position="fill",width=0.8)
Flight.DistancePlot

## -----------------------------------------------------------------------------------------------------------
On.board.servicePlot<-ggplot(training)
+geom_bar(mapping=aes(x=On.board.service,fill=satisfaction),position="fill",width=0.8)
On.board.servicePlot
## -----------------------------------------------------------------------------------------------------------
Checkin.servicePlot<-ggplot(training)
+geom_bar(mapping=aes(x=Checkin.service,fill=satisfaction),position="fill",width=0.8)
Checkin.servicePlot

## -----------------------------------------------------------------------------------------------------------
Inflight.entertainmentPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Inflight.entertainment,fill=satisfaction),position="fill",width=0.8)
Inflight.entertainmentPlot

## -----------------------------------------------------------------------------------------------------------
Food.and.drinkPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Food.and.drink,fill=satisfaction),position="fill",width=0.8)
Food.and.drinkPlot

## -----------------------------------------------------------------------------------------------------------
Gate.locationPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Gate.location,fill=satisfaction),position="fill",width=0.8)
Gate.locationPlot

## -----------------------------------------------------------------------------------------------------------
Inflight.wifi.servicePlot<-ggplot(training)
+geom_bar(mapping=aes(x=Inflight.wifi.service,fill=satisfaction),position="fill",width=0.8)
Inflight.wifi.servicePlot

## -----------------------------------------------------------------------------------------------------------
Leg.room.servicePlot<-ggplot(training)
+geom_bar(mapping=aes(x=Leg.room.service,fill=satisfaction),position="fill",width=0.8)
Leg.room.servicePlot

## -----------------------------------------------------------------------------------------------------------
Inflight.servicePlot<-ggplot(training)
+geom_bar(mapping=aes(x=Inflight.service,fill=satisfaction),position="fill",width=0.8)
Inflight.servicePlot

## -----------------------------------------------------------------------------------------------------------
Checkin.servicePlot<-ggplot(training)
+geom_bar(mapping=aes(x=Checkin.service,fill=satisfaction),position="fill",width=0.8)
Checkin.servicePlot

## -----------------------------------------------------------------------------------------------------------
Arrival.Delay.in.MinutesPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Arrival.Delay.in.Minutes,fill=satisfaction),position="fill",width=0.8)
Arrival.Delay.in.MinutesPlot
## -----------------------------------------------------------------------------------------------------------
Type.of.TravelPlot<-ggplot(training)
+geom_bar(mapping=aes(x=Type.of.Travel,fill=satisfaction),position="fill",width=0.8)
Type.of.TravelPlot

#'
## -----------------------------------------------------------------------------------------------------------
# converting categorical variables into dummy variables
training$satisfaction[training$satisfaction == 'neutral or dissatisfied'] <- 0
training$satisfaction[training$satisfaction == 'satisfied'] <- 1
training$satisfaction <- as.integer(training$satisfaction)

training$Gender[training$Gender == 'Female'] <- 0


training$Gender[training$Gender == 'Male'] <- 1
training$Gender <- as.integer(training$Gender)

training$Customer.Type[training$Customer.Type == 'Loyal Customer'] <- 1


training$Customer.Type[training$Customer.Type == 'disloyal Customer'] <- 0
training$Customer.Type <- as.integer(training$Customer.Type)

training$Type.of.Travel[training$Type.of.Travel == 'Personal Travel'] <- 0


training$Type.of.Travel[training$Type.of.Travel == 'Business travel'] <- 1
training$Type.of.Travel <- as.integer(training$Type.of.Travel)

training$Class[training$Class == 'Eco Plus'] <- 0


training$Class[training$Class == 'Business'] <- 2
training$Class[training$Class == 'Eco'] <- 1
training$Class <- as.integer(training$Class)

#'
## -----------------------------------------------------------------------------------------------------------
head(training)

#'
#'
## -----------------------------------------------------------------------------------------------------------
#Get the correlation between each pair of features
ccorr <- cor(training)
ccorr
#install.packages("corrplot")
library(corrplot)

#Visualizing correlation matrix


corrplot(ccorr, type = "upper", order = "hclust", tl.col = "black", tl.srt = 100,tl.cex = 0.65)

## -----------------------------------------------------------------------------------------------------------

#'
## -----------------------------------------------------------------------------------------------------------
#Splitting the data into training and vaildataion subsets
library(caret)
#Creating train and test data set.
index <- createDataPartition(training$satisfaction, p=0.70, list=FALSE)
trainData <- training[index,]
testData <- training[-index,]

#'
## -----------------------------------------------------------------------------------------------------------
#Randomforest modeling
library(randomForest)
rf <- randomForest(satisfaction ~
Age+Gender+Customer.Type+Type.of.Travel+Class+Flight.Distance+
Inflight.wifi.service+Departure.Arrival.time.convenient+Gate.location+

Food.and.drink+Online.boarding+Seat.comfort+Inflight.entertainment+On.board.service+
Leg.room.service+Baggage.handling+Checkin.service+Inflight.service+
Cleanliness+Arrival.Delay.in.Minutes+Departure.Delay.in.Minutes
, data=trainData)
#Predicting on test data
predictedRF <- predict(rf, testData)

#'
#'
## -----------------------------------------------------------------------------------------------------------
#if the probability higher than one then assign the row to the satisfied group (1) else to
dissatisfied group (0)
y_pred_num <- ifelse(predictedRF>0.5, 1, 0)
#Converting predicted variable to factor.
y_pred <- factor(y_pred_num, levels=c(1,0))
y_pred <- as.factor(y_pred_num)
#Evaluation
accuracy <- mean(y_pred==testData$satisfaction)
accuracy

## -----------------------------------------------------------------------------------------------------------
#Analyzing the Importance of variable using the Variable Importance Plot using randomforest
imp <- as.data.frame(varImp(rf))
imp <- data.frame(overall = imp$Overall,
names = rownames(imp))
imp[order(imp$overall,decreasing = T),]

## -----------------------------------------------------------------------------------------------------------
#Randomforest modeling after feature engineering & tuning
library(randomForest)
rf <- randomForest(satisfaction ~
Age+Online.boarding+Inflight.wifi.service+Class+Type.of.Travel+Inflight.entertainment+Custom
er.Type+
Seat.comfort+Leg.room.service+Flight.Distance
, data=trainData)
#Predicting on test data
predictedRF <- predict(rf, testData)

#if the probability higher than one then assign the row to the satisfied group (1) else to
dissatisfied group (0)
y_pred_num <- ifelse(predictedRF>0.5, 1, 0)
#Converting predicted variable to factor.
y_pred <- factor(y_pred_num, levels=c(1,0))
y_pred <- as.factor(y_pred_num)
#Evaluation
accuracy <- mean(y_pred==testData$satisfaction)
accuracy

## -----------------------------------------------------------------------------------------------------------
#decision tree model
library(rpart)
tree <- rpart(satisfaction ~ Gender + Customer.Type + Age +
Type.of.Travel + Class + Flight.Distance + Inflight.wifi.service +
Departure.Arrival.time.convenient + Ease.of.Online.booking +
Gate.location + Food.and.drink + Online.boarding + Seat.comfort +
Inflight.entertainment + On.board.service + Leg.room.service +
Baggage.handling + Checkin.service + Inflight.service +
Cleanliness + Departure.Delay.in.Minutes + Arrival.Delay.in.Minutes ,
data = trainData, method = 'class', minbucket=25)
#Predicting on test data
predictedds <- predict(tree, testData)

#if the probability higher than one then assign the row to the satisfied group (1) else to
dissatisfied group (0)
y_pred_num <- ifelse(predictedds[,2]>0.5, 1, 0)
#Converting predicted variable to factor.
y_pred <- factor(y_pred_num, levels=c(1,0))
y_pred <- as.factor(y_pred_num)
#Evaluation
accuracy <- mean(y_pred==testData$satisfaction)
accuracy

## -----------------------------------------------------------------------------------------------------------
#Analyzing the Importance of variable using the Variable Importance Plot using randomforest
imp <- as.data.frame(varImp(rf))
imp <- data.frame(overall = imp$Overall,
names = rownames(imp))
imp[order(imp$overall,decreasing = T),]

## -----------------------------------------------------------------------------------------------------------
#decision tree model
library(rpart)
tree <- rpart(satisfaction ~
Age+Online.boarding+Inflight.wifi.service+Class+Type.of.Travel+Inflight.entertainment+Custom
er.Type+
Seat.comfort+Leg.room.service+Flight.Distance ,
data = trainData, method = 'class', minbucket=25)
#Predicting on test data
predictedds <- predict(tree, testData)

#if the probability higher than one then assign the row to the satisfied group (1) else to
dissatisfied group (0)
y_pred_num <- ifelse(predictedds[,2]>0.5, 1, 0)
#Converting predicted variable to factor.
y_pred <- factor(y_pred_num, levels=c(1,0))
y_pred <- as.factor(y_pred_num)
#Evaluation
accuracy <- mean(y_pred==testData$satisfaction)
accuracy

## -----------------------------------------------------------------------------------------------------------
#Reading the dataset
testing <- read.csv(file.choose(), header = TRUE)
#'
## -----------------------------------------------------------------------------------------------------------
head(testing)

#'
#'
#'

You might also like