You are on page 1of 13

Report On

“Uber Data Analysis”

By
Sanjay Shamnani (D17B - 57)
Rahul Tejwani (D17B - 68)
Avi Watwani (D17B - 75)

Department of Computer Engineering

Vivekanand Education Society’s Institute of Technology

2020-21
Table Of Contents

1.Problem Definition and Scope of Project

1.1 Introduction
1.2 Problem Definition and Scope of Project
1.3 Users of the system
1.4 Dataset

2. Literature Review

3. Conceptual System Design

3.1 : Conceptual System Design- CSD Diagram with explanation of each module.
3. 2 : Methodology:
3.2.1 : Data Gathering / Loading
3.2.2 : Data Preprocessing, Descriptive Analysis
3.2.3 : Filtering
3.2.4 : Classification/ clustering etc
3.2.5 : Visualizations

4. Technology Used

5. Implementation

6. Results & Conclusion

7. References
1. Problem definition and scope of the

project 1.1 Introduction

Uber Technologies is a P2P network for sharing travel The Uber platform connects you
with drivers who can take you to your destination or location. This dataset includes
primary data on Uber collections with details that include the date, time of travel, as well
as information on longitude and latitude. In this dataset, we applied K-means clustering
algorithm. Clustering is the method of grouping objects into groups based on similarities.
This algorithm is used to divide a given data set into k groups. Here, k represents the
number of groups and must be provided by the user. We already know k in the case of the
Uber dataset, which is 5 or the number of districts. K-means is a good algorithm option
for Uber 2014 data points. We applied k means grouping in New York City Five districts.
We applied the grouping of k-means in the dataset to better analyze the data set and
identify the different districts within New York.

1.2 Problem definition and scope

This project aims to visualize Uber's ridership growth in NYC during the period and
characterize the demand based on identified patterns in the time series. To estimate the
value of the NYC market for Uber, and its revenue growth. Other insights about the
usage of the service and attempt to predict the demand's growth beyond 2015.

One of Uber’s biggest uses of data (and likely the one that draws the greatest ire from
passengers) comes in the form of surge pricing, a model nicknamed “Geosurge” at Uber.
If you’re running late to an appointment and you need to book a ride in a crowded
downtown space, be prepared to pay almost twice as much for it.

1.3 Users of the system

People from the data analyst and research and development team of uber company will be
the user of this system to get a better understanding of there customers and to do sales
predictions for the coming year.

1.4 Dataset

The data comprises one complete year of trips, with a total of about 31 million entries.
The uncompressed file itself is 1.4 GB, which is still fine to work on a laptop with 16 GB
of RAM. However, some objects will be large enough to require better reasoning about
how to efficiently apply transformations to them, from date-time parsing to arithmetic
functions.
2. Literature Review

Based on the review of our study, this section discusses the increase in demand of the trends
for rides in the cities and how the data analysis was done. As certain researchers discussed
that Uber has adopted the dynamic pricing rate which works on-demand supply principle of
economics. Higher the demand, Higher the price. This becomes really beneficial for the
company and the drivers. In peak hours as well as in Night time, drivers earn a good amount
of money because of the novelty of the industry, the depth of literature in the “on-demand
economy” compared to healthcare, education, or other established industries is relatively
shallow. The only major pieces of academic literature associated with the Uber specifically is
“An Analysis of the Labour Market for Uber’s Driver-Partners in the United states” (Hall
2015) by Hall and Krueger, looking at the effective labor force of the Uber driver pool, and
“Peeking Beneath the Hood of uber” (Chen 2015) by Chen, Mislove, and Wilson, which tries
to “reverse engineer” Uber’s surge pricing methodology.

Less traditional academic research has been done on the effects of Uber on traffic at
fivethirtyeight, but the articles by Fischer-Baum, Bialik, Silver, Flowers, and Mehta have
provided interesting insight into the activities of Uber in New York. The article “Uber Is Taking
Millions of Manhattan Rides Away From Taxis” argues that “the ride-share service probably
isn’t increasing congestion,” (Fischer-Baum 2015) coming to the conclusion that the rides in the
city core are largely Uber replacing taxis, while the Brooklyn and Queens are seeing net
increases in total pickups. The major caveat for this article, as it specifically relates to traffic
congestion is laid out as such: “it’s also possible that many more empty cabs are cruising
Manhattan streets than last year, although it would not be financially viable for drivers — who
primarily rent their cabs — to keep this up for very long.” (Fischer-Baum 2015) If, as the article
states, Uber drivers are accounting for nearly four million new pickups in the core of Manhattan
from 2014–2015 while taxi drivers are cruising while empty, by definition there should be an
increase in congestion, with more cars operating in the same physical space.

The other fivethirtyeight pieces look at where New York City’s Green Cabs — cabs that are
explicitly excluded from Manhattan’s core — operate (Fischer-Baum 2015), the intersection
between Uber’s growth and public transit (Silver 2015), and an analysis of Uber’s operation
in the outer boroughs compared to taxis (Bialik 2015). These pieces are less salient to the
discussion and analysis of congestion and associated traffic issues, but provide important
context to the activity of Uber in the city. The final fivethirtyeight article, “The Debate on
Uber’s Impact in New York City Is far From Over,” focuses on the political machinations
around the effects of Uber in New York City, which is at the heart of this paper, as explained
in the Introduction and Background sections
3.1 Conceptual System Design- CSD Diagram with the explanation of each module.

Fig 1: Conceptual Diagram

1) Gathering data from Kaggle Community: We first start by collecting the data from the
Kaggle community and we found the official uber csv dataset with 31 million entries.
1) Cleaning and filtering the data: We then clean the data by looking for any missing and
potential erroneous data to minimize the chances of errors during the visualization.
2) Timeframe Feature Vector: We then create and prepare the feature vectors for time
ranges, 1 month, 3 months, 6 months, 1 year, 3 years and 5 years. Finally, we begin with
the classifier.
3) Random forest Classification: We use the Random Forest Classifier and cross-
validation for the number of trees ranging from 1 to 40 and find out which trees give
more accuracy.
4) Visualization of data:- We then visualize the data based on our criteria.
5) Drawing inference from our analysis:- We then analyze the results and draw
conclusions from it.
3. 2 Methodology:
3.2.1 Data Gathering / Loading

The source data is collected and stored as CSVs. The data contains features distinct from those in
the set previously released and thoroughly explored by FiveThirtyEight and the Kaggle
community. For this system we are using the same data which has 31 million entries with every
information about the ride. This dataset will give us an extensive look at uber's revenue growth,
customer behaviour, etc.

3.2.2 Data Preprocessing, Descriptive Analysis

Some of the data cleanings, preprocessing and transformation steps that were performed are listed:

1. There were very few clearly erroneous entries in the dataset and a small proportion of
suspicious cases or anomalies that warrant further internal analysis. These cases are, for
example, those with very long distance traveled, but destination still recorded within New
York City, or those with average speed slower than walking, but very long duration
(beyond a reasonable assumption for the amount of time taken to get out of some really
bad traffic gridlock, or the unlikely situation of a driver left waiting).
2. In addition, there was a small proportion of cases with distance and duration equal to
zero. Do they represent canceled trips? A small subset actually shows distinct origin and
destination zones, indicating that some distance was driven but not recorded. In other
cases, the recorded distance was zero, but the trip duration was more than that, even
beyond 5 minutes in rarer cases.

3.2.3 Filtering

The suspicious and anomalous data points were not changed, but the trips with a duration greater
than 16 hours (123 cases out of nearly 31 million, mostly system errors) were removed from the
dataset. In addition, the data was censored at exactly 365 days for convenience, which left only
1852 cases out.

Finally, about 4% of the destination data were missing, and an extremely small number of cases
had missing trip distance and destination. The imputation method chosen for the latter set was
the mean distance and duration of their respective origin-destination pair. The entries with
missing destination were left unchanged, although the information from the vast number of
complete cases could potentially be used to determine the most probable destination.

3.2.4 Classification/ clustering etc


We transform our features into a Pandas data frame for analysis. We separate ~23% data as test
data and ~77% as training data. We perform 10-fold cross-validation on Random Forest
Classification with 1 to 40 trees in the forest, on training data for each individual time frame. We
use box plots to visualize the results of cross-validation and pick an ideal estimator size. We train
our Random Forest Classifier with the ideal estimator on the training data and check it's
performance on test data by predicting labels and comparing them with the pre-assigned labels.
We also generate feature importance charts from random forest classification to educate our
users about the features to look at for any time horizon.

3.2.5 Visualizations

We obtain the following classification scores for each time frame:


1 month: 1.7083799839019775
3 month: 1.882817029953003
6 month: 2.0246660709381104
1 year: 2.2246660709381104
2 year: 2.675666070938110
4 year: 2.786660709381104
5 year: 2.9999660709381104

The scores appear to be very high because of the class imbalance problem. This means that if the
classifier blindly assigned zeros to every data point, it would still produce a good score just
because it correctly labelled bad data points as bad by chance. We can observe this by
calculating the precision, recall and plotting Receiver Operating Characteristics (ROC curves).
The problem can be solved in two ways - either by reducing the number of bad samples (not
recommended for this particular scenario) or by increasing the good samples (which can be done
by duplicating the good samples). The model trained after making these changes would perform
better on unseen samples. We would like to solve the Class Imbalance problem and train a better
model as a future improvement for this product.
4. Technology Used

1. Python 2.7 (programming language) with the following modules:

a. Numpy
b. Pandas
c. Matplotlib
d. Seaborn

2. Google Colab notebook for implementation

5. Implementation

2) We first start by collecting the data from the kaggle community and we found the official
uber csv dataset with 31 million entries.
3) Then we filter the data by searching for missing/erroneous data to minimize any error
during training process.
4) We then use a Risk Factor scale of 1 to 5, 1 being High Risk, 2 being Moderately High
Risk, 3 being Moderate Risk, 4 being Moderately Low Risk, and 5 being Low Risk.
5) After processing the data for the machine learning step, we pre-select the finds with a risk
factor of 4.
6) We then create and prepare the feature vectors for time ranges, 1 month, 3 months, 6
months, 1 year, 3 years and 5 years. Finally, we begin with the classifier.
7) We use the Random Forest Classifier and cross-validation for the number of trees
ranging from 1 to 40 and find out which trees give more accuracy.
8) We then pick a tree which has optimal performance on the validation data. For the 6-
month feature importance tree, we take n_estimate = 2.
9) We use the models with maximum F1 scores and then output the base revenue of the uber
in its first year of operation and the monthly growth of the company.
6. Results and Conclusion

Uber launched in NYC in May of 2011, the first city outside of its San Francisco headquarters.
The number of Uber trips per day in NYC is still growing significantly. In 2017 so far, this
number has often surpassed 200,000, but the plot below shows that by mid-2015 it was hovering
around 120,000.

Another interesting insight from the plot above is the effect of major events on the number of
trips. For the period of time analyzed, negative impacts are related to Thanksgiving, Christmas,
Memorial Day, and Independence Day. In addition, an apparently odd and very significant drop
in the number of trips is shown on January 27th. This was a result of a curfew imposed by NYC's
mayor in preparation for a blizzard. In the other hand, the plot also highlights which events have
positively impacted the number of trips that year, with the International Marathon and the Gay
Pride Week standing out as the strongest contributors.

The data also allows us to visualize other interesting trends over time. In the bar charts below,
we can see that the demand for Uber is higher from 4 PM until around midnight. Saturday has
the highest demand. Interestingly, Sunday shows a level of demand similar to Wednesday, which
is higher than Monday or Tuesday.
In the dataset, the locations have been anonymized, but it's reasonable to assume that the top
origin codes are probably based in Manhattan. In this case, the top destination codes are also
based in Manhattan, because they overlap, as can be seen in the plot below.
The relation between a trip's duration and distance is not entirely linear. Rather, it approximates
to a power function because shorter trips, occurring mostly within busy areas of traffic, tend to
result in lower average trip speed.

For the first time, it's possible to estimate Uber's revenue in NYC with more granularity due to
the availability of each trip's duration and distance in the dataset.

However, the revenue figures are described as "base revenue", because other critical information
is missing. Uber offers different types of services with distinct prices, namely Uber X, Uber XL,
Uber Black, Uber SUV, and UberPool. Except for the latter, all other services carry a higher fare
than UberX. Moreover, Uber practices "price surging", which affects the revenue positively. I
chose to use Uber X published fares to calculate the revenue as this is probably the most popular
product. Therefore, the base revenue is a conservative estimate of the actual revenue. Indeed, the
mean revenue per trip between September 2014 and August 2015, calculated from the data by
assuming they were all Uber X, was $19. Comparatively, Uber has published that the average
NYC UberX fare was $27 in September 2014.The chart below show the estimated base revenue
growth for each month:
It's important to note that from the gross estimated revenue, Uber's share is about 25% of the
total. Therefore, we can conservatively estimate that Uber's gross margin in NYC from
September 2014 to August 2015 was in the order of $150 million dollars. The estimated gross
margin, considering instead the $27 average fare previously mentioned, was of the order of $210
million dollars. Not bad! The next chart illustrates the percentage growth in revenue, month-
over-month, from September 2014:

You might also like