You are on page 1of 18

PROJECT REPORT

ON

Flight Fare Prediction Using Machine Learning

Submitted by:

Farhan Ahmad

Registration No : 11916199

Course Code: INT248

Under the Guidance of

Ms. Jaspreet Kaur

School of Computer Science & Engineering

Lovely Professional University, Phagwara

(OCT, 2022)
ACKNOWLEDGEMENT

I would like to express my gratitude towards my University as well as Ms.


Jaspreet Kaur

for providing me the golden opportunity to do this wonderful Project On


Flight Fare Prediction Using Machine Learning , which also helped me in
doing a lot of homework and learning. As a result, I came to know about so
many new things. So, I am really thank full to them.

Moreover I would like to thank my friends who helped me a lot whenever I


got stuck in some problem related to my course. I am really thankfull to
have such a good support of them as they always have my back whenever
I need.

Also,I would like to mention the support system and consideration of my


parents who have always been there in my life to make me choose right
thing and oppose the wrong. Without them I could never had learned and
became a person who I am now.

I have taken efforts in this project. However, it would not have been
possible without the kind support and help of many individuals and
organizations. I would like to extend my sincere thanks to all of them.
Flight Fare Prediction Using Machine Learning

Overview

In this article, we will be analyzing the flight fare prediction using Machine
Learning dataset using essential exploratory data analysis techniques then
will draw some predictions about the price of the flight based on some
features such as what type of airline it is, what is the arrival time, what is
the departure time, what is the duration of the flight, source, destination and
more.
About the dataset

1. Airline:  So this column will have all the types of airlines like Indigo,
Jet Airways, Air India, and many more.

2. Date_of_Journey: This column will let us know about the date on


which the passenger’s journey will start.

3. Source: This column holds the name of the place from where the
passenger’s journey will start.

4. Destination: This column holds the name of the place to where


passengers wanted to travel.

5. Route: Here we can know about that what is the route through which
passengers have opted to travel from his/her source to their
destination.

6. Arrival_Time: Arrival time is when the passenger will reach his/her


destination.

7. Duration: Duration is the whole period that a flight will take to


complete its journey from source to destination.

8. Total_Stops: This will let us know in how many places flights will stop
there for the flight in the whole journey.
9. Additional_Info: In this column, we will get information about food, kind
of food, and other amenities.

10. Price: Price of the flight for a complete journey including all the
expenses before onboarding.

All the Lifecycle In A Data Science Project is divided into four parts:

1. Exploratory Data Analysis


2. Feature Engineering
3. Feature selection
4. Model Deployment

Exploratory Data Analysis

Now, let’s start with the task of machine learning to predict Flight fare. I will
start by importing all the necessary libraries that we need for this task and
import the train dataset.

1) Importing libraries

2) Importing the dataset


The first thing that we can do when tackling a data science problem is
getting an understanding of the dataset that you are working with. Key
observations and trends in the data were noted down. All correlations within
the variables and the output ‘price’ were monitored. For this you can
use df.info,df.head()etc.

By observing the Train dataset ,we get to noticed that :

The Route column contains a list of cities which we will need to separate,


since we would have multiple combinations in our dataset.

The Arrival_time column has dates attached along with, which we will need


to separate.

The Duration, Date_of_Journy column are in a string format, which need to


convert to integer type.
The Total_stops column has word ‘stops’ added along with the number of
stops, and Dep_Time, Duration columns are also not in an appropriate form
which we need to convert into integer.

In Data Analysis We will try to Find out the below stuff

1. Missing Values in the dataset.

2. All the Numerical variables and Distribution of the numerical variables

3. Categorical Variables

4. Outliers

5. Relationship between an independent and dependent feature(price)

By observing the Train dataset ,we get to noticed that :


The Route column contains a list of cities which we will need to separate,
since we would have multiple combinations in our dataset.

The Arrival_time column has dates attached along with, which we will need


to separate.

The Duration, Date_of_Journy column are in a string format, which need to


convert to integer type.

The Total_stops column has word ‘stops’ added along with the number of


stops, and Dep_Time, Duration columns are also not in an appropriate form
which we need to convert into integer.

In Data Analysis We will try to Find out the below stuff

1. Missing Values in the dataset.

2. All the Numerical variables and Distribution of the numerical variables

3. Categorical Variables

4. Outliers

5. Relationship between an independent and dependent feature(price)


There is no missing value in our train dataset because all columns has
10683 entries.as we see in train_data.info() all the dependent column are
“object” datatype so as to use these columns properly for model we have to
convert these datatype into appropriate form which we’ll see in feature
engineering .

Exploratory data analysis is cross-classified in two different ways where


each method is either graphical or non-graphical. And then, each method is
either univariate, bivariate or multivariate.

Univariate Analysis

In univariate analysis, there is only one dependable variable. The objective


of univariate analysis is to derive the data and analyze the pattern present
in it. here in a train dataset, it explores each categorical columns separately.

The bar graph is very convenient while comparing categories of data . It


helps to track changes over time. It is best for visualizing discrete data.
 Jet Airways is the most preferred airline with the highest row count,
followed by Indigo and AirIndia. Jet Airways business is the costliest
airways.

 Count for Vistara Premium economy, Trujet, Multiple carries premium


economy and Jet airways business is quite low.

Feature Selection

Model Training:

We do not know beforehand which model will perform best on this problem,
as it is unknowable. We used Extra tree Regressor, Random Forest
Regression Model on the train set. you can try any number of regression
models and choose one among them which is best suitable.

we drop the “price” column from train dataset and make independent
variable to find correlation between dependent and independent data. After
cleaning the data, we can visualize data and better understand the
relationships between different variables. There are many more
visualizations that you can do to learn more about your dataset, like
scatterplots, histograms, boxplots, etc .

Using sns.heatmap(), we can see that the ‘Total_Stops’ is positively


correlated with ‘Price’ which leads to increase in cost of fuel and increase
the price. Also Total_stops is highly correlated with Duration_hours means if
the no. of stops would increase, the duration hours of the flight will also
increase.
Feature Selection
Fitting model using Random Forest
Fitting parametric distributions

You can also usedistplot()to fit a parametric distribution to a dataset and


visually evaluate how closely it corresponds to the observed data. it should
be a closed Gaussian distributed graph and the difference between
‘y_test’(real value)and ‘predictions’ should also be minimal.Here most of the
residuals are 0, which means our model is generalizing well.
Plotting y_test vs predictions.

Scatter plots are used to observe relationships between variables and uses
dots to represent the relationship between them. here points are nearly
aligned in a line.

Hyperparameter Tuning

we use the Randomized SearchCV for the best hyperparameters. A random


search of parameters, using 5 fold cross-validation search across 100
different combinations.

Checking accuracy of the model


Evaluating the model accuracy is an essential part of the process of
creating machine learning models to describe how well the model is
performing in its predictions. The MSE, MAE, and RMSE metrics are mainly
used to evaluate the prediction error rates and model performance in
regression analysis.

 MAE (Mean absolute error) represents the difference between the


original and predicted values extracted by averaged the absolute
difference over the data set.

 MSE (Mean Squared Error) represents the difference between the


original and predicted values extracted by squared the average
difference over the data set.

 RMSE (Root Mean Squared Error) is the error rate by the square root of
MSE.

Saving model
BIBLIOGRAPHY

 Books
 website
 You tube

You might also like