You are on page 1of 14

INSY-5377-001- WEB AND SOCIAL ANALYTICS PROJECT REPORT

AVANI TALLURI, NANDITHA, HAMPI, DHEERAJ JAKKUVA


University of Texas Arlington
Table of Contents

1. Introduction …………………………………………………………………………………………….

2. Data Description…………………………………………………………………………………………….

3. Data Variables…………………………………………………………………………………………….

4. Research Questions…………………………………………………………………………………………….

5. Methodology…………………………………………………………………………………………….

5.1 Loading Kaggle Uber-Lyft Data set…………………………………………………………………

5.2 Data cleaning and Handling of missing data………………………………………………………………

5.3 Creating Dummy variables for the categorial variables…………………………………………..

6. Results and Discussion…………………………………………………………………………………………….

6.1 Multiple Regression…………………………………………………………………………………………….

6.2 Correlation Matrix…………………………………………………………………………………………….

6.3 K-Mean Clustering Analysis…………………………………………………………………………………………….

7. Conclusions…………………………………………………………………………………………….

8. References ……………………………………………………………………………………………
Figure1: Kaggle dataset format

Figure 2: Data variables selected for Analysis

Figure 3: Data cleaning and handling the missing data.

Figure 4: Dummy variables for all categorial variables

Figure 5: R2 value for the model by using python

Figure 6:

Figure 7:

Figure 8: Elbow curve

Figure 9: K-mean Clustering


1. Introduction

As we all know like food and water, transportation also plays a key role in our lives because

without transportation it will not be possible for us to travel to stores, offices, and schools

etc. In this technical world as smartphone is easily accessible it is easy for anyone to book a

cab through mobile app rather than waiting for a cab on road. The Uber and Lyft apps are

widely used for booking a cab. Hence, we did our analysis on uber and Lyft coming to the

landscape, Uber is an international company located in 69 countries and around 900 cities

around the world whereas Lyft on the other hand operates in 644 cities in the US and 12

cities in Canada alone. However, in US it is the second-largest passenger company with a

market share of 31%. Although they have similar features, there are some exceptions in

terms of price which varies in accordance with traffic, weather, and time. This variance is

called as "Surge price" in Uber and Lyft calls it as "Prime Time”. We did analysis on Lyft and

Uber fare. We got the dataset from kaggle.com where the main purpose and objective of

this large dataset is to model how price or cab varies with all the features that've been

given.

2. Data Description

Our Initial dataset is downloaded from Kaggle:

https://www.kaggle.com/code/hkhoi91/data-visualization-uber-vs-lyft-ultimate-battle/

notebook
This dataset contains nearly 693072 distinct records and 10 properties to it , The data was collected

for about 24 days between the dates 26 th November 2018 to 19th December 2018 in Boston Area.

The data contains regular travel details like which type of cab, source, time_stamp,

destination, distance, surge charges, and type of service. The data set has null values to it,

and we have done the data cleaning. Only the unlabeled part of data cab_rides.csv was

used in our analysis.

Before we start managing and analyzing data, the first thing we should do is think about the

PURPOSE. What it means is that you have to think about the reasons why you are going to

do any analysis. We started by asking questions in brainstorming sessions such as Where?

What? How? Who? Which?

Figure1: Kaggle Dataset with all the data variables


3. Data variables.

Distance: The total distance covered in a trip

Cab type: To determine whether it is Lyft/Uber

Timestamp: Booking time of the ride

Source: Pickup location of the ride

Destination: Drop location of the ride

Price: Total price incurred on the booking

surge multiplier: Surge charges applied on the booking if any

id: Booking ID of each booking

productid: Booking ID generated by the website for every booking

Name: Determines the service type selected by user

4. Research Questions.

Our analysis focused on predicting the performance of uber and lyft cab services with the
help of the variables price and distance and below are the questions we answered:

 Which type of cab service is often preferred?


 Which product type is often preferred?
 Analyze the cab fares for different products of both Lyft and Uber?
 What is the price from certain locations for each cab service?
 Determine the cheapest / costliest ride and longest/shortest ride in the total data
set?

5. Methodology
5.1 Loading Kaggle dataset

First, we are loading Kaggle dataset into python. This fetches the whole data present in the data set
cab_rides.csv and creates a simple table. Figure2 , shows the total data variables considered for the
project.

Figure 2 Data variables selected for Analysis

5.2 Data cleaning and handling of missing data

The price data variable has 55095 null values and have handled this situation by replacing
them with the mean value
Figure 3: Data cleaning and handling the missing data

5.3 Frame Dummy variables for categorial variables


Figure 4: Dummy variables for all categorial variables

6. Results and Discussion

6.1 Multiple Regression


We have chosen to run multiple regression on our data to predict the prices for Uber and

Lyft. Below is the regression output in python for our dataset. R square value for the model

we created is 0.9282. R-squared valued would indicate that 92% of the variance of the

dependent variable being studied is explained by the variance of the independent variable.
Figure 5: R2 value for the model by using python.

6.2 Correlation Matrix


We tried to get correlation between the variables in our dataset to check if the variables are

affected by any other variable. Below are some screenshots of the outputs obtained. From

the results, we predicted that all the variables are independent variables.

Figure 6:
Figure 7:

R-Squared is a statistical measure of fit that indicates how much variation of a dependent

variable is explained by the independent variable(s) in a regression model. R-squared measures

the strength of the relationship between the model and the dependent variable on a

convenient 0 – 1 scale. For the model we created the R-Squared value is 0.9282 which indicates

that the predicted values are almost same as the actual observed values. Correlation matrix

explains the strength of the relationship between an independent and dependent variable. As

per the correlation matrix obtained, we can see that all the variables in the data set are

independent variables.

6.3 K-Mean Clustering Model Analysis.


Clustering is the process of dividing the datasets into groups, consisting of similar data-

points. Clustering is a type of unsupervised machine learning, which is used when you have

unlabeled data. Here, we have applied a K-Means clustering algorithm whose main goal is

to group similar elements or data points into a cluster. We performed the K-means

clustering analysis between price and distance as they are highly correlated, and this gave

the better results. We got the elbow curve from the dataset and the score is 4 Cluster k-

means. These are some of the outputs obtained from python code.

Fig

ure 8: Elbow curve


Figure 9: K-means clustering

7. Conclusions
We worked on the cab fare analysis for lyft and uber between the time frame 26th

November 2018 to 19th December 2018. Uber gets more bookings compared to lyft in this

timeframe due to the variations in price. The more distance or duration we travel in cabs,

the cab fare will be more. So comparatively all uber services are showing less price than

Lyft, in almost all the locations and at all time stamps.

Uber has less surge charges compared to lyft, and this helped customers to choose uber, as

its total price will be low. The R-squared value is 0.9282 in the multiple regression which
shows the predicted analysis is like the actual analysis. Also, according to the correlation

matrix there are no dependent variables in the dataset.

8. Acknowledgements and References

(1)Dataset: https://www.kaggle.com/code/hkhoi91/data-visualization-uber-vs-lyft-ultimate-
battle/notebook
(2)

You might also like