Professional Documents
Culture Documents
1. Introduction …………………………………………………………………………………………….
2. Data Description…………………………………………………………………………………………….
3. Data Variables…………………………………………………………………………………………….
4. Research Questions…………………………………………………………………………………………….
5. Methodology…………………………………………………………………………………………….
7. Conclusions…………………………………………………………………………………………….
8. References ……………………………………………………………………………………………
Figure1: Kaggle dataset format
Figure 6:
Figure 7:
As we all know like food and water, transportation also plays a key role in our lives because
without transportation it will not be possible for us to travel to stores, offices, and schools
etc. In this technical world as smartphone is easily accessible it is easy for anyone to book a
cab through mobile app rather than waiting for a cab on road. The Uber and Lyft apps are
widely used for booking a cab. Hence, we did our analysis on uber and Lyft coming to the
landscape, Uber is an international company located in 69 countries and around 900 cities
around the world whereas Lyft on the other hand operates in 644 cities in the US and 12
market share of 31%. Although they have similar features, there are some exceptions in
terms of price which varies in accordance with traffic, weather, and time. This variance is
called as "Surge price" in Uber and Lyft calls it as "Prime Time”. We did analysis on Lyft and
Uber fare. We got the dataset from kaggle.com where the main purpose and objective of
this large dataset is to model how price or cab varies with all the features that've been
given.
2. Data Description
https://www.kaggle.com/code/hkhoi91/data-visualization-uber-vs-lyft-ultimate-battle/
notebook
This dataset contains nearly 693072 distinct records and 10 properties to it , The data was collected
for about 24 days between the dates 26 th November 2018 to 19th December 2018 in Boston Area.
The data contains regular travel details like which type of cab, source, time_stamp,
destination, distance, surge charges, and type of service. The data set has null values to it,
and we have done the data cleaning. Only the unlabeled part of data cab_rides.csv was
Before we start managing and analyzing data, the first thing we should do is think about the
PURPOSE. What it means is that you have to think about the reasons why you are going to
4. Research Questions.
Our analysis focused on predicting the performance of uber and lyft cab services with the
help of the variables price and distance and below are the questions we answered:
5. Methodology
5.1 Loading Kaggle dataset
First, we are loading Kaggle dataset into python. This fetches the whole data present in the data set
cab_rides.csv and creates a simple table. Figure2 , shows the total data variables considered for the
project.
The price data variable has 55095 null values and have handled this situation by replacing
them with the mean value
Figure 3: Data cleaning and handling the missing data
Lyft. Below is the regression output in python for our dataset. R square value for the model
we created is 0.9282. R-squared valued would indicate that 92% of the variance of the
dependent variable being studied is explained by the variance of the independent variable.
Figure 5: R2 value for the model by using python.
affected by any other variable. Below are some screenshots of the outputs obtained. From
the results, we predicted that all the variables are independent variables.
Figure 6:
Figure 7:
R-Squared is a statistical measure of fit that indicates how much variation of a dependent
the strength of the relationship between the model and the dependent variable on a
convenient 0 – 1 scale. For the model we created the R-Squared value is 0.9282 which indicates
that the predicted values are almost same as the actual observed values. Correlation matrix
explains the strength of the relationship between an independent and dependent variable. As
per the correlation matrix obtained, we can see that all the variables in the data set are
independent variables.
points. Clustering is a type of unsupervised machine learning, which is used when you have
unlabeled data. Here, we have applied a K-Means clustering algorithm whose main goal is
to group similar elements or data points into a cluster. We performed the K-means
clustering analysis between price and distance as they are highly correlated, and this gave
the better results. We got the elbow curve from the dataset and the score is 4 Cluster k-
means. These are some of the outputs obtained from python code.
Fig
7. Conclusions
We worked on the cab fare analysis for lyft and uber between the time frame 26th
November 2018 to 19th December 2018. Uber gets more bookings compared to lyft in this
timeframe due to the variations in price. The more distance or duration we travel in cabs,
the cab fare will be more. So comparatively all uber services are showing less price than
Uber has less surge charges compared to lyft, and this helped customers to choose uber, as
its total price will be low. The R-squared value is 0.9282 in the multiple regression which
shows the predicted analysis is like the actual analysis. Also, according to the correlation
(1)Dataset: https://www.kaggle.com/code/hkhoi91/data-visualization-uber-vs-lyft-ultimate-
battle/notebook
(2)