Web and Social Group 9 Report

INSY-5377-001- WEB AND SOCIAL ANALYTICS PROJECT REPORT
AVANI TALLURI, NANDITHA, HAMPI, DHEERAJ JAKKUVA

University of Texas Arlington
Table of Contents
1. Introduction …………………………………………………………………………………………….
2. Data Description…………………………………………………………………………………………….
3. Data Variables…………………………………………………………………………………………….
4. Research Questions…………………………………………………………………………………………….
5. Methodology…………………………………………………………………………………………….
5.1 Loading Kaggle Uber-Lyft Data set…………………………………………………………………
5.2 Data cleaning and Handling of missing data………………………………………………………………
5.3 Creating Dummy variables for the categorial variables…………………………………………..
6. Results and Discussion…………………………………………………………………………………………….
6.1 Multiple Regression…………………………………………………………………………………………….
6.2 Correlation Matrix…………………………………………………………………………………………….
6.3 K-Mean Clustering Analysis…………………………………………………………………………………………….
7. Conclusions…………………………………………………………………………………………….
8. References ……………………………………………………………………………………………
Figure1: Kaggle dataset format
Figure 2: Data variables selected for Analysis
Figure 3: Data cleaning and handling the missing data.
Figure 4: Dummy variables for all categorial variables
Figure 5: R2 value for the model by using python
Figure 6:
Figure 7:
Figure 8: Elbow curve
Figure 9: K-mean Clustering

1. Introduction
As we all know like food and water, transportation also plays a key role in our lives because
without transportation it will not be possible for us to travel to stores, offices, and schools
etc. In this technical world as smartphone is easily accessible it is easy for anyone to book a
cab through mobile app rather than waiting for a cab on road. The Uber and Lyft apps are
widely used for booking a cab. Hence, we did our analysis on uber and Lyft coming to the
landscape, Uber is an international company located in 69 countries and around 900 cities
around the world whereas Lyft on the other hand operates in 644 cities in the US and 12
cities in Canada alone. However, in US it is the second-largest passenger company with a
market share of 31%. Although they have similar features, there are some exceptions in
terms of price which varies in accordance with traffic, weather, and time. This variance is
called as "Surge price" in Uber and Lyft calls it as "Prime Time”. We did analysis on Lyft and
Uber fare. We got the dataset from kaggle.com where the main purpose and objective of
this large dataset is to model how price or cab varies with all the features that've been
given.
2. Data Description
Our Initial dataset is downloaded from Kaggle:
https://www.kaggle.com/code/hkhoi91/data-visualization-uber-vs-lyft-ultimate-battle/
notebook
This dataset contains nearly 693072 distinct records and 10 properties to it , The data was collected
for about 24 days between the dates 26 th November 2018 to 19th December 2018 in Boston Area.
The data contains regular travel details like which type of cab, source, time_stamp,
destination, distance, surge charges, and type of service. The data set has null values to it,
and we have done the data cleaning. Only the unlabeled part of data cab_rides.csv was
used in our analysis.
Before we start managing and analyzing data, the first thing we should do is think about the
PURPOSE. What it means is that you have to think about the reasons why you are going to
do any analysis. We started by asking questions in brainstorming sessions such as Where?
What? How? Who? Which?
Figure1: Kaggle Dataset with all the data variables

3. Data variables.
Distance: The total distance covered in a trip
Cab type: To determine whether it is Lyft/Uber
Timestamp: Booking time of the ride
Source: Pickup location of the ride
Destination: Drop location of the ride
Price: Total price incurred on the booking
surge multiplier: Surge charges applied on the booking if any
id: Booking ID of each booking
productid: Booking ID generated by the website for every booking
Name: Determines the service type selected by user
4. Research Questions.
Our analysis focused on predicting the performance of uber and lyft cab services with the
help of the variables price and distance and below are the questions we answered:
 Which type of cab service is often preferred?

 Which product type is often preferred?
 Analyze the cab fares for different products of both Lyft and Uber?
 What is the price from certain locations for each cab service?
 Determine the cheapest / costliest ride and longest/shortest ride in the total data
set?
5. Methodology
5.1 Loading Kaggle dataset
First, we are loading Kaggle dataset into python. This fetches the whole data present in the data set
cab_rides.csv and creates a simple table. Figure2 , shows the total data variables considered for the
project.
Figure 2 Data variables selected for Analysis
5.2 Data cleaning and handling of missing data
The price data variable has 55095 null values and have handled this situation by replacing
them with the mean value
Figure 3: Data cleaning and handling the missing data
5.3 Frame Dummy variables for categorial variables

Figure 4: Dummy variables for all categorial variables
6. Results and Discussion
6.1 Multiple Regression

We have chosen to run multiple regression on our data to predict the prices for Uber and
Lyft. Below is the regression output in python for our dataset. R square value for the model
we created is 0.9282. R-squared valued would indicate that 92% of the variance of the
dependent variable being studied is explained by the variance of the independent variable.
Figure 5: R2 value for the model by using python.
6.2 Correlation Matrix

We tried to get correlation between the variables in our dataset to check if the variables are
affected by any other variable. Below are some screenshots of the outputs obtained. From
the results, we predicted that all the variables are independent variables.
Figure 6:
Figure 7:
R-Squared is a statistical measure of fit that indicates how much variation of a dependent
variable is explained by the independent variable(s) in a regression model. R-squared measures
the strength of the relationship between the model and the dependent variable on a
convenient 0 – 1 scale. For the model we created the R-Squared value is 0.9282 which indicates
that the predicted values are almost same as the actual observed values. Correlation matrix
explains the strength of the relationship between an independent and dependent variable. As
per the correlation matrix obtained, we can see that all the variables in the data set are
independent variables.
6.3 K-Mean Clustering Model Analysis.

Clustering is the process of dividing the datasets into groups, consisting of similar data-
points. Clustering is a type of unsupervised machine learning, which is used when you have
unlabeled data. Here, we have applied a K-Means clustering algorithm whose main goal is
to group similar elements or data points into a cluster. We performed the K-means
clustering analysis between price and distance as they are highly correlated, and this gave
the better results. We got the elbow curve from the dataset and the score is 4 Cluster k-
means. These are some of the outputs obtained from python code.
Fig
ure 8: Elbow curve

Figure 9: K-means clustering
7. Conclusions
We worked on the cab fare analysis for lyft and uber between the time frame 26th
November 2018 to 19th December 2018. Uber gets more bookings compared to lyft in this
timeframe due to the variations in price. The more distance or duration we travel in cabs,
the cab fare will be more. So comparatively all uber services are showing less price than
Lyft, in almost all the locations and at all time stamps.
Uber has less surge charges compared to lyft, and this helped customers to choose uber, as
its total price will be low. The R-squared value is 0.9282 in the multiple regression which
shows the predicted analysis is like the actual analysis. Also, according to the correlation
matrix there are no dependent variables in the dataset.
8. Acknowledgements and References
(1)Dataset: https://www.kaggle.com/code/hkhoi91/data-visualization-uber-vs-lyft-ultimate-
battle/notebook
(2)

Web and Social Group 9 Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web and Social Group 9 Report

Uploaded by

Copyright:

Available Formats

INSY-5377-001- WEB AND SOCIAL ANALYTICS PROJECT REPORT

AVANI TALLURI, NANDITHA, HAMPI, DHEERAJ JAKKUVA

5.1 Loading Kaggle Uber-Lyft Data set…………………………………………………………………

5.2 Data cleaning and Handling of missing data………………………………………………………………

5.3 Creating Dummy variables for the categorial variables…………………………………………..

6. Results and Discussion…………………………………………………………………………………………….

6.1 Multiple Regression…………………………………………………………………………………………….

6.2 Correlation Matrix…………………………………………………………………………………………….

6.3 K-Mean Clustering Analysis…………………………………………………………………………………………….

Figure 2: Data variables selected for Analysis

Figure 3: Data cleaning and handling the missing data.

Figure 4: Dummy variables for all categorial variables

Figure 5: R2 value for the model by using python

Figure 8: Elbow curve

Figure 9: K-mean Clustering

cities in Canada alone. However, in US it is the second-largest passenger company with a

Our Initial dataset is downloaded from Kaggle:

used in our analysis.

do any analysis. We started by asking questions in brainstorming sessions such as Where?

What? How? Who? Which?

Figure1: Kaggle Dataset with all the data variables

Distance: The total distance covered in a trip

Cab type: To determine whether it is Lyft/Uber

Timestamp: Booking time of the ride

Source: Pickup location of the ride

Destination: Drop location of the ride

Price: Total price incurred on the booking

surge multiplier: Surge charges applied on the booking if any

id: Booking ID of each booking

productid: Booking ID generated by the website for every booking

Name: Determines the service type selected by user

 Which type of cab service is often preferred?

Figure 2 Data variables selected for Analysis

5.2 Data cleaning and handling of missing data

5.3 Frame Dummy variables for categorial variables

6. Results and Discussion

6.1 Multiple Regression

6.2 Correlation Matrix

variable is explained by the independent variable(s) in a regression model. R-squared measures

6.3 K-Mean Clustering Model Analysis.

ure 8: Elbow curve

Lyft, in almost all the locations and at all time stamps.

matrix there are no dependent variables in the dataset.

8. Acknowledgements and References

You might also like