You are on page 1of 43

# Working With The Divvy Dataset

Pratik Agrawal

Introduction
Over the past couple of years Divvy has organized data challenges for invigorating some
innovation in the Chicago Data Science community as well as learn new ways to visualize
and manage the bike rental system.

Problem
There are always Divvy vans that ferry bikes around from station to station based on the lack
or surplus of bikes at a given location. This movement of bikes is labor and time intensive.
Both of which are high costs that Divvy has to bear. It would be nice to be able to predict the
volume of rentals, and allow for precise scheduling.
In this project I have decided to work with daily rental volume (total rides) as my target
variable, and as this is a supervised learning problem the techniques that would be used are
as followsa) Lasso Regression
b) Ridge Regression
c) Elastic Net

Data Sets
a) Divvy data set 2015 Q1 & Q2
b) Route Information data- In order to enrich the data set with more information, I decided to
include distance information (route calculation from HERE.com Route Calculation API) for
each origin/destination pair in the dataset.
c) Weather data- weather data from Wunderground.com was downloaded for the period
pertaining to the Divvy data set.

In [1]: import pandas as pd
import matplotlib.pyplot as plt
import gc
import seaborn as sns
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]: import warnings
warnings.filterwarnings('ignore')

The Dataset
1. Lets read the readme.txt file supplied with the dataset, and see what all features are
included in this data set
Even though the file is for the 2013 dataset, the columns have not changed much in the
current year

if line!=None:
print line

This file contains metadata for both the Trips and Stations table.

datachallenge (http://DivvyBikes.com/datachallenge) or email questio
ns to data@DivvyBikes.com.

Variables:

trip_id: ID attached to each trip taken
starttime: day and time trip started, in CST
stoptime: day and time trip ended, in CST
bikeid: ID attached to each bike
tripduration: time of trip in seconds
from_station_name: name of station where trip originated
to_station_name: name of station where trip terminated
from_station_id: ID of station where trip originated
to_station_id: ID of station where trip terminated
usertype: "Customer" is a rider who purchased a 24-Hour Pass; "Subsc
riber" is a rider who purchased an Annual Membership
gender: gender of rider
birthyear: birth year of rider

Notes:

* First row contains column names

.. The meta data provided is actually very useful.read_csv(". Read the files In [3]: df_trips_1=pd./. Under normal circumstances such a clean set is hard to come by./data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q df_stations=pd.concat([df_trips_1.read_csv("./data/Divvy_Trips_2015-Q1Q2/Divvy_Stations_20 df_trips = pd.df_trips_2]) Lets take a quick peek at the head for each data frame .read_csv(".. since that is another feature missing from datasets..* Total records = 759. * Gender and birthday are only available for Subscribers Metadata for Stations table: Variables: name: station name latitude: station latitude longitude: station longitude dpcapacity: number of total docks at each station as of 2/7/2014 online date: date the station went live in the system From the above information we have a good idea about what the dataset looks like./.789 * Trips that did not include a start or end date were removed from o riginal table././data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q df_trips_2=pd... 2.

head() Out[6]: id name latitude longitude dpcapacity landmark 0 2 Michigan Ave & Balbo Ave 41.885042 -87. 6) Joining the origin station id with the data from the stations data frame .612795 31 548 The dataframes above can be joined on from_station_id/to_station_id and id Lets look at the shape of the dataframes In [7]: df_trips.874053 -87.In [5]: df_trips.613348 23 545 3 5 State St & Harrison St 41.627716 23 30 4 6 Dusable Harbor 41.head() Out[5]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_name 0 4738454 3/31/2015 4/1/2015 1095 23:58 0:03 299 117 Wilton Ave & Belmont Ave 1 4738450 3/31/2015 4/1/2015 537 23:59 0:15 940 43 Michigan Ave & Washington St 2 4738449 3/31/2015 4/1/2015 2350 23:59 0:11 751 162 Damen Ave & Wellington Ave 3 4738448 3/31/2015 4/1/2015 938 23:59 0:19 1240 51 Clark St & Randolph St 4 4738445 3/31/2015 4/1/2015 379 23:54 0:15 1292 134 Peoria St & Jackson Blvd In [6]: df_stations. 12) In [8]: df_stations.624091 35 541 1 3 Shedd Aquarium 41.856268 -87.872293 -87.shape Out[7]: (1096239.615355 31 544 2 4 Burnham Harbor 41.867226 -87.shape Out[8]: (474.

right_on= In [15]: df_divvy. 24) .shape Out[12]: (1096239.left_on="from_station_id".df_stations.right_on In [12]: df_from. 18) In [13]: df_from.df_stations.In [4]: df_from=pd.left_on="to_station_id".merge(df_from.merge(df_trips.shape Out[15]: (1096239.head() 1 4738431 3/31/2015 3/31/2015 68 23:42 23:47 260 117 Wilton Ave & Belmont Ave 2 4738386 3/31/2015 3/31/2015 422 23:04 23:07 186 117 Wilton Ave & Belmont Ave 3 4738303 3/31/2015 3/31/2015 1672 22:19 22:22 145 117 Wilton Ave & Belmont Ave 4 4738089 3/31/2015 3/31/2015 2720 21:07 21:10 200 117 Wilton Ave & Belmont Ave Joining the destination station id with the data from the stations data frame In [5]: df_divvy=pd.

com/routing/7.2/calculateroute.In [16]: df_divvy.jso In [18]: json_array = simplejson.loads(url) Lets take a look at what the HERE Calcuate Route API response looks like .here.com api.tail() Out[16]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_n 1096234 5348427 5/27/2015 5/27/2015 2817 7:04 7:21 1023 428 Dorchester Ave 63rd St 1096235 5338209 5/26/2015 5/26/2015 2819 10:38 10:53 912 428 Dorchester Ave 63rd St 1096236 5670422 6/16/2015 6/16/2015 3113 18:01 18:16 869 95 Stony Island Ave 64th St 1096237 5375075 5/28/2015 5/28/2015 2004 15:49 16:04 892 391 Halsted St & 69t 1096238 5611858 6/13/2015 6/13/2015 4703 9:36 9:42 374 388 Halsted St & 63r 5 rows × 24 columns Lets try a sample call to the HERE maps api In [16]: from urllib2 import urlopen from StringIO import StringIO import simplejson To use the here.api. one has to register as a developer. and is limited to a 100K calls/month For security purposes the application id and code for my dev user has not been included in the api call made below. In [17]: url = urlopen('http://route.cit.

In [19]: json_array Out[19]: .

u'position': {u'latitude': 41.18'. <span class="distance-description">Go for <span class="length">23 m </span>. u'mappedPosition': {u'latitude': 41. u'type': u'stopOver'}.106'. u'length': 1120.30. <span class="dis tance-description">Go for <span class="length">886 m</span>.</span>'. <span class="distance-description">Go for <span class="length">316 m</span >. u'length': 23. u'travelTime': 9}.6240 349}. u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">N Lake Shore Dr</span> <span class="n umber">(US-41 N)</span>.{u'response': {u'language': u'en-us'.1862745. u'longitude': -8 7.63.9146268. u'travelTime': 40}. {u'_type': u'PrivateTransportManeuverType'.900804. u'route': [{u'leg': [{u'end': {u'label': u'W Menomonee St'. {u'_type': u'PrivateTransportManeuverType'.</span>'. u'sideOfStreet': u'left'.0-1185'.623 7659}. {u'_type': u'PrivateTransportManeuverType'.64332}. u'metaInfo': {u'interfaceVersion': u'2. u'linkId': u'+19805890'. u'instruction': u'Head toward <span class="toward_street">N Michigan Ave</span> on <span class="street">E Lake Shore Dr</span>. u'mappedRoadName': u'W Menomonee St'. u'longitude': -87. u'instruction': u'Turn <span class="direction">left</span> o nto <span class="next-street">W La Salle Dr</span>. u'id': u'M3'. u'spot': 0.9146799. u'length': 3620. u'id': u'M1'.625 7515}. u'position': {u'latitude': 41. u'travelTime': 76}. u'length': 316.60. u'maneuver': [{u'_type': u'PrivateTransportManeuverType'. u'id': u'M2'.9008181.1 km</span>.6. u'longitude': -87. u'longitude': -87. u'position': {u'latitude': 41. u'instruction': u'Keep <span class="direction">right</span> toward <span class="sign"><span lang="en">North Ave</span>/<span lan g="en">IL-64</span>/<span lang="en">Lasalle Dr</span></span>. u'mapVersion': u'8. u'longitude': -87. u'moduleVersion': u'7. u'shapeIndex': 60. u'originalPosition': {u'latitude': 41.</span >'.9106638.6433185}. <span class="distance-description">Go for < span class="length">1. u'id': u'M4'.</span>'. u'timestamp': u'2015-12-08T21:37:18Z'}.2. .

643 3185}. u'travelTime': 123}. u'travelTime': 0}]. {u'_type': u'PrivateTransportManeuverType'.643 4219}. u'id': u'M8'. u'travelTime': 3}.6237659}. u'mappedRoadName': u'E Lake Shore Dr'. {u'_type': u'PrivateTransportManeuverType'.u'length': 886. . u'length': 853.'. u'linkId': u'-858448508'. u'sideOfStreet': u'right'. u'id': u'M5'. {u'_type': u'PrivateTransportManeuverType'.625 9875}. u'length': 19. <span class="di stance-description">Go for <span class="length">403 m</span>. u'id': u'M6'. u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">W North Ave</span> <span class="numbe r">(IL-64)</span>.</span>'. u'longitude': -87.</span >'.9008181. u'position': {u'latitude': 41. u'shapeIndex': 0. {u'_type': u'PrivateTransportManeuverType'. u'start': {u'label': u'E Lake Shore Dr'. u'longitude': -87.</span >'. u'mappedPosition': {u'latitude': 41. u'travelTime': 114}. u'longitude': -87.9146268. u'length': 0. u'travelTime': 431}]. <span class="d istance-description">Go for <span class="length">19 m</span>. u'spot': 0.9133461. u'travelTime': 66}.6237771}. Your destination is on the left. u'longitude': -8 7. <span class="distance-description">Go for <span c lass="length">853 m</span>.643 5506}.9109857.9009599. u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">N Larrabee St</span>. u'position': {u'latitude': 41.633 1329}.9146228. u'position': {u'latitude': 41. u'longitude': -87. u'type': u'stopOver'}. u'longitude': -87.0247934. u'length': 403. u'instruction': u'Turn <span class="direction">right</span> onto <span class="next-street">W Menomonee St</span>. u'instruction': u'Arrive at <span class="street">W Menomonee St</span>. u'id': u'M7'. u'position': {u'latitude': 41.9111681. u'position': {u'latitude': 41. u'originalPosition': {u'latitude': 41. u'longitude': -87.

u'longitude': 87. u'transportModes': [u'car']. u'longitude': -8 7.6 km</span> an d <span class="time">7 mins</span>.6433185}.1862745. u'type': u'fastest'}. u'sideOfStreet': u'right'. u'linkId': u'+19805890'.6237771}. u'travelTime': 431}. u'sideOfStreet': u'left'. u'originalPosition': {u'latitude': 41. u'distance': 3620. u'mappedRoadName': u'W Menomonee St'. u'spot': 0. u'mappedPosition': {u'latitude': 41. u'spot': 0.9009599. u'shapeIndex': 60. u'originalPosition': {u'latitude': 41. u'linkId': u'-858448508'. u'type': u'stopOver'}]}]}} To access the distance between the two points provided in the API request. u'trafficMode': u'disabled'.9146268. u'shapeIndex': 0. u'flags': [u'park'].6237659}.9146799. In [23]: print "base_time: ". u'baseTime': 431.9008181. u'mappedRoadName': u'E Lake Shore Dr'. we can look at the summary section of the JSON object In [22]: print json_array['response']['route'][0]['summary']['distance'] 3620 Similarly we can access other parameters such as base time and traffic time (both have been provided for vehicle based routing). {u'label': u'W Menomonee St'.64332}. u'trafficTime': 431.'. u'type': u'stopOver'}. This API however does not provide estimates as to how traffic affects the bicycle times.json_array['response']['route'][0]['summary']['baseTime print "traffic_time: ". u'text': u'The trip takes <span class="length">3. u'waypoint': [{u'label': u'E Lake Shore Dr'. u'longitude': -8 7. u'longitude': -87.u'mode': {u'feature': []. u'summary': {u'_type': u'RouteSummaryType'.0247934.json_array['response']['route'][0]['summary']['traff base_time: 431 traffic_time: 431 . u'mappedPosition': {u'latitude': 41.

here.cit."longitude_x". 'distance':distance.com/routing/7. u'distance': 923.loads(url) base_time=json_array['response']['route'][0]['summary']['baseTime'] traffic_time=json_array['response']['route'][0]['summary']['trafficTime distance=json_array['response']['route'][0]['summary']['distance'] return pd.com Calculate Route API for any two locations. u'trafficTime': 208.axis=1) In [25]: df_dist Out[25]: base_time distance json_array traffic_time 0 208 923 {u'response': {u'route': [{u'leg': [{u'start':. 208 1 208 923 {u'response': {u'route': [{u'leg': [{u'start':."latitude_y" .Lets create a function to query the HERE.'.api.Series({'base_time':base_time. And also test this with the first two rows of the data set In [24]: def calc_dist_time(x): url = urlopen('http://route. 208 In [26]: df_dist...head(2). u'baseTime': 208.. u'text': u'The trip takes <span class="length">923 m</span> and <sp an class="time">3 mins</span>..com API In [29]: df_temp=df_divvy.2/calculateroute json_array = simplejson. 'json_array':json_array}) df_dist=df_divvy. 'traffic_time':traffic_time. u'travelTime': 208} Now lets do a simple reduction in the number of calls made to the HERE.json_array[0]['response']['route'][0]['summary'] Out[26]: {u'_type': u'RouteSummaryType'.apply(calc_dist_time.drop_duplicates(["latitude_x".

"longitu In [66]: df_dist_time = pd. str(df_dist_time.latitude_y). and throttled/denied thereafter).df_dist. Hence you will not see execution numbers for some of the code blocks In [ ]: df_dist=df_temp. This can be further reduced by removing the duplicated between x-y and y-x combinations of the locations.left_on="ix". the number of calls that will need to be made to the HERE.longitude_y)) .longitude_x). and best in class in terms of quota. str(df_dist_time. Note: I tried Google Maps API (only a few thousand free calls.In [30]: df_temp Out[30]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_na 0 4738454 3/31/2015 4/1/2015 23:58 0:03 1095 181 4447991 184 192 299 117 Wilton Ave & Belmont Ave 1/17/2015 1/17/2015 645 15:26 15:57 1859 43 Michigan Ave & Washington St 4631588 3/14/2015 3/14/2015 1226 18:20 18:38 1103 162 Damen Ave & Wellington Ave 4735646 3/31/2015 3/31/2015 1312 17:16 17:37 1296 51 Clark St & Rand St 6/8/2015 6/8/2015 Peoria St & Jack As can be seen from above.merge(df_dist_matrix. Lets run the query for each combination of location in this reduced dataset I already ran the code below prior to forming this notebook.axis=1) In [33]: df_dist_matrix = df_divvy[["latitude_x"."latitude_y". which is well below the monthly quota."longitude_x".right_on="ix" In [76]: df_dist_time["key"] = "%s_%s_%s_%s"%(str(df_dist_time.com API to be the most responsive. and only found HERE.apply(calc_dist_time. and had saved the results of the queries. as well as Open Street Maps API.com API is 65K. str(df_dist_time.latitude_x).

df_dist_time. str(df_divvy./. str(df_divvy.latitude_y). 'distance'./data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_t In [18]: list(df_dist_time.columns)) Out[39]: 28 Lets save this data set In [102]: df_divvy. 'ix'.read_csv('.csv In [17]: df_dist_time = pd.inplace=True) In [39]: len(list(df_divvy.to_csv('.csv') We can free up the memory.In [80]: df_divvy["key"] = "%s_%s_%s_%s"%(str(df_divvy..merge(df_divvy.axis=1./data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time."json_array"].to_csv('...drop(["ix"... by forcing garbage collection. 'longitude_x'. str(df_divvy.longitude_x).longitude_y)) In [89]: df_dist_time.latitude_x). I've done this as there is lot of data held in memory. and there is no further use for it.columns) Out[18]: ['Unnamed: 0'.left_on=["latitude_x". 'latitude_x'.. 'longitude_y'././. 'latitude_y'. 'base_time'. 'json_array'. . 'traffic_time'] In [19]: df_divvy = pd."longitude_ In [20]: df_divvy./data/Divvy_Trips_2015-Q1Q2/complete-data.

/data/Divvy_Trips_2015-Q1Q2/CustomWeather. For this purpose I downloaded the weather history from wunderground.com Note: Code execution resumes from here..read_csv('.com In [22]: weather = pd.head() Out[22]: CDT Max Max Mean Min MeanDew Min Dew TemperatureF TemperatureF TemperatureF PointF DewpointF PointF 0 1/1/15 32 25 17 16 11 4 1 1/2/15 36 28 20 22 19 15 2 1/3/15 37 34 31 36 32 22 3 1/4/15 36 21 5 35 22 -5 4 1/5/15 10 5 -1 5 -3 -10 5 rows × 23 columns ./.. as code above requires a dev account to make calls to HERE.csv' weather.collect() Out[8]: 114 Lets also download weather information for each day of Q1 & Q2 2015.In [8]: df_dist=[] df_dist_matrix=[] df_dist_time=[] df_from=[] df_trips=[] df_trips_1=[] df_trips_2=[] df_divvy_all=[] import gc gc.

' Max Wind SpeedMPH'. ' Min Sea Level PressureIn'. 'MeanDew PointF'. 'Max Humidity'. ' Max Gust SpeedMPH'. 'Min DewpointF'.columns] .columns=[c. ' Mean VisibilityMiles'. 'Min TemperatureF'. ' WindDirDegrees'] Lets clean the column names.strip(" ") for c in weather. 'PrecipitationIn'. ' Min VisibilityMiles'. ' Mean Sea Level PressureIn'. ' Mean Wind SpeedMPH'. 'Max Dew PointF'. ' Min Humidity'. ' Max VisibilityMiles'. ' Events'. and get rid of the leading white space In [24]: weather. 'Mean TemperatureF'. 'Max TemperatureF'.columns) Out[23]: ['CDT'. ' Max Sea Level PressureIn'. ' Mean Humidity'. ' CloudCover'.In [23]: list(weather.

split(" ")[0]) In [28]: df_divvy["date"]=pd. 'PrecipitationIn'. 'Max Wind SpeedMPH'.apply(lambda x: x. In [26]: df_divvy.drop("Unnamed: 0". 'Mean VisibilityMiles'.starttime.columns) Out[25]: ['CDT'. 'Max Dew PointF'. 'CloudCover'.axis=1. 'Mean Humidity'. 'Min TemperatureF'. 'MeanDew PointF'. 'Min Humidity'. 'Max VisibilityMiles'. 'Min VisibilityMiles'. 'WindDirDegrees'] Lets convert the date feature of both df_divvy and weather data to the sklearn datetime object. 'Min DewpointF'. 'Max Humidity'.inplace=True) In [27]: df_divvy["date"]=df_divvy.to_datetime(df_divvy. 'Mean Sea Level PressureIn'.In [25]: list(weather. 'Max TemperatureF'. 'Events'. 'Min Sea Level PressureIn'. 'Mean Wind SpeedMPH'. 'Mean TemperatureF'. 'Max Gust SpeedMPH'.date) . 'Max Sea Level PressureIn'.

CDT) In [31]: weather.to_datetime(weather.head() Out[31]: CDT Max Max Mean Min MeanDew Min Dew TemperatureF TemperatureF TemperatureF PointF DewpointF PointF 0 201532 01-01 25 17 16 11 4 1 201536 01-02 28 20 22 19 15 2 201537 01-03 34 31 36 32 22 3 201536 01-04 21 5 35 22 -5 4 201510 01-05 5 -1 5 -3 -10 5 rows × 23 columns .head() Out[29]: trip_id starttime stoptime bikeid tripduration from_station_id from_station_name 0 4738454 3/31/2015 4/1/2015 23:58 0:03 1095 1 4731216 299 117 Wilton Ave & Belmont Ave 3/31/2015 3/31/2015 719 8:03 8:08 313 117 Wilton Ave & Belmont Ave 2 4729848 3/30/2015 3/30/2015 168 21:22 21:27 310 117 Wilton Ave & Belmont Ave 3 4729672 3/30/2015 3/30/2015 2473 20:42 20:51 595 117 Wilton Ave & Belmont Ave 4 4715390 3/27/2015 3/27/2015 1614 21:26 21:31 312 117 Wilton Ave & Belmont Ave 5 rows × 28 columns In [30]: weather["CDT"]=pd.In [29]: df_divvy.

_build_map() return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%. We will also plot a random sampling of the user types (subscribers v/s customers) and the stations they travel between In [6]: from IPython.convert_objects(convert_num Analysis EDA Now that we have all the data in order. which may not be supported in all browsers. lets take a look at where these stations are located on the map. height: 510 .html"): """ Embeds a linked iframe to the map into the IPython notebook.PrecipitationIn=weather.display import HTML import folium def inline_map(map): """ Embeds the HTML source of the map directly into the IPython notebook. """ map. Note: this method will not capture the source of the map into the noteb This method should work for all maps (as long as they use relative urls """ map.In [32]: weather. path="map. height: 510p def embed_map(map. the HTML5 srcdoc attribute. This method will not work if the map depends on any files (json data).create_map(path=path) return HTML('<iframe src="files/{path}" style="width: 100%.PrecipitationIn.

tiles='Stamen Toner' for i in range(0. df_stations fill_color='blue') np.shape[0]): map_osm.circle_marker(location=[df_stations.random.longitude_x else: map_osm.arange(1.7142335].shuffle(numbers) for i in range(1.longitude_x inline_map(map_osm) Out[7]: As can be seen from the above map- .df_stations.latitude_x[numbers[i]].Map(location=[41.9065732.latitude[i].df_divvy.seed(123) numbers = np.random.usertype[numbers[i]]=="Subscriber"): map_osm.df_divvy.10000): if(df_divvy.line([[df_divvy.-87.In [7]: map_osm = folium.latitude_x[numbers[i]].line([[df_divvy.1000000) np.

Note: The above map is interactive. Lets look at the distances travelled by usertype (Customer v/s Subscriber) . e) The bike stations on the periphery of the map see the least traffic. c) Subscribers tend to use this service more as a daily commute option versus customers who use this for shorter distances. Millenium Park). b) Customers are market with green lines. Loop. 1. so you should be able to zoom in/out and pan throughout.a) Subscribers are marked with the red lines. d) Customers tend to run the bikes in the more tourist-y areas (Lake Shore Trail.

ylim(0.distance.hist plt. as well as contribute to the majority of bike rentals. However.subplot("212") df_temp = df_divvy[df_divvy. Lets look at who are the most active bike renters in the subscribers category- .In [35]: fig = plt.ylim(0.distance. 2. daily-riders.95)].set_figheight(9) fig. who rent bikes on weekends and Thursdays.25000) plt. fontsize=16) ax = plt.hist ax.suptitle("Distance Bins for Customer/Subscriber". we can think of the riders as1.quantile(0.distance<df_temp.distance.set_title("Subscriber".fontsize=14) ax = plt. the customers (or tourists/one-time riders) also contribute to a significant number of rides.25000) ax.xlabel("Rental Volume") plt.distance.set_ylabel("Distance in meters") plt.set_title("Customer".usertype=="Subscriber"] df_temp[df_temp.who do not have active subscription.subplot("211") df_temp = df_divvy[df_divvy. tourists.distance<df_temp.show() It is clear from the above plots that the subscribers in general ride longer distances. 2.set_ylabel("Distance in meters") ax.usertype=="Customer"] df_temp[df_temp. Within the customers.quantile(0.figure() fig. and are riding these bikes on Monday-Wednesday.fontsize=14) ax.set_figwidth(15) fig.riders.95)].

usertype=="Subscriber")).sum((x. .ascending=False).groupby("birthyear").sort(["Subscriber"].apply(explore) In [37]: df_birthyear_agg.sum((x.reset_index(inplace=True) In [38]: plot(df_birthyear_agg.birthyear.show() In [39]: df_birthyear_agg.astyp "Customer":np.head(10) Out[39]: birthyear Customer Subscriber 61 1986 0 45352 63 1988 2 44295 62 1987 0 42418 60 1985 13 41319 59 1984 0 40523 64 1989 0 39337 58 1983 0 36683 57 1982 0 32490 65 1990 0 32280 56 1981 0 29144 From the above graph and table we see that the millenials are the largest group of subscribers.df_birthyear_agg.In [36]: def explore(x): return pd.Series({"Subscriber":np.astype df_birthyear_agg=df_divvy.Subscriber) plt.usertype=="Customer")).

distance). 3.merge(df_divvy_group.count_nonzero(x.tripduration). or if they have.reset_index(inplace=True) In [42]: df_divvy_group = pd.weather. "birth_year_diff_86":np. "male":np. Lets now look at how the weather affects bike rental volumes.groupby(["usertype". "female":np. For this purpose we will roll up bike rentals to the day. It would seem that these subscribers have not reported their correct age. In [40]: def roll_up(x): return pd.count_nonzero(x).mean(1986-x. This is based on the preceding analysis.count_nonzero(x.left_on="date".gender=="Male").the difference in birth year from 1986.Series({"total_rides":np."date"]).birthyear)}) df_divvy_group=df_divvy.apply(roll_up) In [41]: df_divvy_group. right_on . a) First we will take a look at the mean temperature and total ridership Here we will create a few new featuresa) total rides: the total number of rentals for the day b) average trip duration for the day (in seconds) c) average trip distance for the day (in meters) d) birth_year_diff_86 . then they are in the pink of health.mean(x. "avg_distance_m":np.One can also note that there are a few subscribers with the age of 100 and over.gender=="Female"). "avg_trip_duration_s":np.mean(x.

set_title("Subscriber".fontsize=14) ax.fontsize=14) ax = plt.hist(alpha=0.7.xlabel("Rental Volume") plt.set_ylabel("Mean Temperature") plt.suptitle("Distance Bins for Customer/Subscriber".ylim(0.show() .hist(alpha=0.set_figheight(9) fig.20) ax.ylim(0. fontsize=16) ax = plt.figure() fig.set_figwidth(15) fig.In [43]: fig = plt.color="blue") plt.subplot("211") df_temp = df_divvy_group[df_divvy_group.set_ylabel("Mean Temperature") ax.usertype=="Subscriber"] df_temp['Mean TemperatureF'].color="red") ax.set_title("Customer".subplot("212") df_temp = df_divvy_group[df_divvy_group.7.bins=100.usertype=="Customer"] df_temp['Mean TemperatureF'].20) plt.bins=100.

scatter(df_temp["Mean TemperatureF"].sum(x. fontsize=16) ax = plt.color="red" ax.In [44]: fig = plt.subplot("211") df_temp = df_divvy_group[df_divvy_group. The relationship seems to be slightly exponential for Customersv/s Subscribers.total_rides.usertype=="Customer"] ax.color="blue" ax.total_rides)}) .suptitle("Temperature and the rider".scatter(df_temp["Mean TemperatureF"].show() Clearly there is a relationship between total ridership and the temperature.Series({"TotalRidership":np.usertype=="Subscriber"] ax.set_title("Subscriber".set_figheight(9) fig.subplot("212") df_temp = df_divvy_group[df_divvy_group. Subscriberscan be seen hiring bikes at much lower temperatures.fontsize=14) ax = plt.total_rides.df_temp. b) Now lets look at the precipitation in inches and how that affects the ridership In [45]: def fun_sum(x): return pd.fontsize=14) plt.set_figwidth(15) fig.set_title("Customer".figure() fig.df_temp.

TotalRidership.figure() fig.set_title("Subscriber".PrecipitationIn. precipitation results in a drastic drop in ridership.usertype=="Customer"] df_=df_temp.TotalRidership.fontsize=14) plt.groupby("PrecipitationIn"). fontsize=16) ax = plt.groupby("PrecipitationIn").suptitle("Precipitation and the rider volume".plot(df_.color="red") ax.apply(fun_sum) df_.df_.reset_index(inplace=True) ax.plot(df_.PrecipitationIn.set_figheight(9) fig.subplot("211") df_temp = df_divvy_group[df_divvy_group.set_figwidth(15) fig.set_title("Customer".df_.show() As can be seen.In [46]: fig = plt.reset_index(inplace=True) ax.fontsize=14) df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"] df_=df_temp.color="blue") ax.apply(fun_sum) df_. c) How does wind speed affect the total rider volume? .

fontsize=14) df_temp = df_divvy_group[df_divvy_group. fontsize=16) ax = plt.reset_index(inplace=True) ax.reset_index(inplace=True) ax. the rider volume is affected by the wind speed. This could probably be attributed to fewer days with 8 mph wind speeds.figure() fig.df_.set_figwidth(15) fig.dt.plot(df_['Mean Wind SpeedMPH'].color="blue") ax.groupby("Mean Wind SpeedMPH").dt. however there is a sudden dip at 8mph.TotalRidership.suptitle("Mean Wind Speed (mph) and the rider volume".dayofweek .color="red") ax.plot(df_['Mean Wind SpeedMPH'].groupby("Mean Wind SpeedMPH").fontsize=14) plt. however there are multiple sections in this graph. and hence a lower total ridership volume.TotalRidership. We notice that right after 9 mph the total rider volume starts a steady decline.date. We see that the rider volume increases between 0 7 mph.apply(fun_sum) df_.show() As we see from the above graph.apply(fun_sum) df_. d) Lets look at day of the week and how that affects ridership In [48]: df_divvy_group["day_of_year"] = df_divvy_group.usertype=="Customer"] df_=df_temp.df_.dayofyear df_divvy_group["day_of_week_mon_is_0"] = df_divvy_group.usertype=="Subscriber"] df_=df_temp.set_title("Customer".In [47]: fig = plt.set_figheight(9) fig.date.set_title("Subscriber".subplot("211") df_temp = df_divvy_group[df_divvy_group.

plot(df_. An idea to explore.show() We can see from the above graphs that there is a difference between the Customer and Subscriber rider characteristic.subplot("211") fig.reset_index(inplace=True) ax.TotalRidership.df_.apply(fun_sum) df_.usertype=="Customer"] df_=df_temp.df_.usertype=="Subscriber"] df_=df_temp.day_of_week_mon_is_0.TotalRidership.In [49]: fig = plt.set_figwidth(15) ax = plt.day_of_week_mon_is_0.astyp .day_of_week_mon_is_0>4). Customers ride more on weekends. fontsize=16) df_temp = df_divvy_group[df_divvy_group.groupby("day_of_week_mon_is_0").plot(df_.apply(fun_sum) df_.reset_index(inplace=True) ax.suptitle("Day of week and the rider volume".color="red") plt.if we explore the difference between weekend v/s weekday In [50]: df_divvy_group["IsWeekend"] = (df_divvy_group. and subscribers ride more on weekdays.set_figheight(9) fig.color="blue") df_temp = df_divvy_group[df_divvy_group.figure() fig.groupby("day_of_week_mon_is_0").

subplot("212") df_temp = df_divvy_group[df_divvy_group.color="red") plt.subplot("211") fig. Models being built- .set_figheight(9) fig.TotalRidership.In [248]: fig = plt.set_figwidth(15) ax = plt.apply(fun_sum) df_.IsWeekend.plot(df_.groupby("IsWeekend").reset_index(inplace=True) ax.plot(df_.TotalRidership.color="blue") ax = plt.show() In [52]: df_divvy_group.usertype=="Subscriber"] df_=df_temp.groupby("IsWeekend")..to_csv('../data/Divvy_Trips_2015-Q1Q2/data-weather-distan Model Building We are going to build a few different models with a different selection of features for each group of models.IsWeekend.apply(fun_sum) df_.df_.usertype=="Customer"] df_=df_temp.df_. fontsize=16) df_temp = df_divvy_group[df_divvy_group./.suptitle("Is Weekend? and the rider volume". 1.figure() fig.reset_index(inplace=True) ax.

Feature scaling: Enabled 4.metrics as mt import sklearn.grid_search import RandomizedSearchCV from sklearn.a) Lasso Regression b) Ridge Regression c) Gradient Boosted Regressor d) Elastic Net 2.cross_validation as cv import sklearn. Separate models for Customer and Subscriber user types 6.linear_model as lm import sklearn. Precipitation. and Birth Year Diff From 1986 d) All from c) and dummy coded day of week feature In [137]: import sklearn.grid_search import GridSearchCV Model Building Code . Grid Search CV: 10 Fold CV 5. Train/Test:70/30 3.preprocessing as ps from sklearn.ensemble as ensemble import sklearn. Models for feature setsa) All data except day of week b) All data c) Temperature.

pred_test.y_train. X_test. pred_test[model]. config): model = GridSearchCV(lm.param_grid=config["params"]["lasso" model. mse_test[model] = ModelScorer(pred_train[model return mse_train. y_train.GradientBoostingRegressor().predict(X_train).param_grid=config["params"]["ridge" model. X_test.fit(X_train. model. model.fit(X_train. test_size pred_train={} pred_test={} mse_train={} mse_test={} params={} for model in config["models"]: if "lasso" in model: pred_train[model].predict(X_train).param_grid=config["params"]["en" model. params[model]=ModelBuilder if "gbr" in model: pred_train[model].Ridge(). pred_train) mse_test = MSECalc(y_test.y_train) return model. pred_test[model]. params[model]=ModelBuilder if "ridge" in model: pred_train[model].best_params def ModelComparator(X.best_params def ModelBuilder_gbr(X_train. config): model = GridSearchCV(ensemble. model. model.train_test_split(X.param_grid model. config): model = GridSearchCV(lm.best_params def ModelBuilder_ridge(X_train.y.mean_squared_error(y. mse_test def ModelBuilder_lasso(X_train.fit(X_train.best_params def ModelBuilder_en(X_train. pred_test[model]. model.ElasticNet(). mse_test.In [189]: def MSECalc(y.predict(X_test). config): X=ps.predict(X_train).y_train) return model.y_train. model. y_pred): return round(mt. pred_test[model]. params[model]=ModelBuilder if "en" in model: pred_train[model]. X_test.y_train.y_train) return model. params Configuration to drive Model Building code . X_test. y_test): mse_train = MSECalc(y_train. model.predict(X_test).fit(X_train.y_train) return model. pred_test) return mse_train.Lasso().scale(X) y=ps. X_test.y.predict(X_test).y_train.scale(y) X_train.y_pred). model.predict(X_test). params[model]=ModelBuilder mse_train[model].8) def ModelScorer(pred_train.predict(X_train). y_test = cv. y_train. config): model = GridSearchCV(lm.

linspace(0. "min_samples_leaf":range(1. "min_samples_split":range(1.1. "cv":10} mse_train={} mse_test={} Code for dummy coding of categoricals In [94]: def dummy_coding(x.0. "gbr":{"learning_rate":np.01.In [150]: config={"models":["lasso".0. "l1_ratio":np.0.01."gbr"].1.0."ridge".linspace(0. 1.1.0001. as well aggregating all scores and details about the run . "params":{"lasso":{"alpha":[0.0.0.0001.1]. 1.01.1].1.1.01.1].001.0. num=15)."en".01.001.001.val)] = (x[col]==val). "en":{"tol":np.0.0.01.astype(int) return sep Method that encapsulates running all models. num=15)}. 0.0. num=15).unique()) for val in vals: sep["%s_%s"%(col. "tol":[0.1.001.05. "alpha":[0.0001.col_names): sep={} for col in col_names: vals=list(x[col]. "tol":[0.5)}}.1]}.0.1]}.001. "ridge":{"alpha":[0.0.10).linspace(0.

sqrt(scores_df.config): train.mse_train) scores_df["rmse_test"] = np.append(user) feature_set.y.append(features) scores_df["feature_set"] = feature_set scores_df["usertype"] = usertype scores_df["model"] = models scores_df["mse_train"] = mse_train scores_df["mse_test"] = mse_test scores_df["rmse_train"] = np. param = ModelComparator(X.mse_test) scores_df["params"] = params return scores_df Variable to catch all scores In [227]: scores = [] Models built with different feature setsa) All data except day of week .sqrt(scores_df.append(test[model]) params.append(train[model]) mse_test.In [226]: def run_models(user.config) mse_test = [] mse_train = [] params = [] models = [] usertype = [] feature_set = [] scores_df = pd. test.features.DataFrame() for model in config["models"]: models.append(model) mse_train.y.X.append(param[model]) usertype.

usertype=="Customer"] X."Events".total_rides X_cust."birth_year_diff_86".drop(["total_rides"."all_features".drop("total_rides"."CDT".config)) scores."all_features"."day_of_week_mon_is_0"].total_rides X=pd.drop(["total_rides"."CDT".y.append(run_models("subscriber"."date".X.inplace=True) X.axis=1.drop(["usertype".total_rides y=X."day_of_week_mon_is_0"].dropna(inplace=True) y_cust=X_cust.pd."CDT"."Events"."all_except_dow".inplace=True) X.drop(["usertype".X_cust."CDT"."all_except_dow".usertype=="Subscriber"] X_cust=df_divvy_group[df_divvy_group.concat([X."female" X_cust.axis=1 X_cust.append(run_models("customer".DataFrame(dummy_coding(X.drop(["usertype".usertype=="Subscriber"] X_cust=df_divvy_group[df_divvy_group.append(run_models("customer"."date".config)) scores.X_cust. X_cust=pd."day_of_week_mon_is_0"].X.config c) Weather Data Only .["day_of_week_mon X_cust.usertype=="Customer"] X.In [228]: X=df_divvy_group[df_divvy_group.config b) All data In [230]: X=df_divvy_group[df_divvy_group."date".y.inplace=True) scores.y_cust.DataFrame(dummy_coding(X_cust.inplace=True) X_cust.drop(["usertype"."Events"].pd.dropna(inplace=True) y_cust=X_cust.concat([X_cust.append(run_models("subscriber".dropna(inplace=True) X.inplace=True) scores."date".axis=1.axis=1.axis=1.total_rides y=X."Events".drop("total_rides".["day_of_week_mon_is_0"]))].dropna(inplace=True) X.y_cust.axis=1."birth_year_diff_86"."female" X_cust.

append(run_models("customer".axis=1."PrecipitationIn"]] pd.drop("total_rides"."birth_year_diff_8 X_cust=X[["total_rides".show() In [232]: X_cust.X_cust.dropna(inplace=True) y_cust=X_cust.figsize=(15.y_cust.total_rides X_cust.drop("total_rides".In [231]: X=df_divvy_group[df_divvy_group.scatter_matrix(X."temp_prec_birth".plotting.total_rides y=X."Mean TemperatureF".append(run_models("subscriber".tools.config d) Weather data and day of week .usertype=="Subscriber"] X_cust=df_divvy_group[df_divvy_group.y.inplace=True) scores.X.config)) scores."PrecipitationIn"."Mean TemperatureF".dropna(inplace=True) X."temp_prec_birth".10)) plt.usertype=="Customer"] X=X[["total_rides".axis=1.inplace=True) X.

inplace=True) scores.y."day_of_week_ X=pd.["day_of_week_mon_is_0"]))].concat([X_cust.axis=1.10)) plt."temp_prec_birth_dow"."Mean TemperatureF".config)) scores.axis=1.X.tools.concat(scores) ."day_of_week_mon_is_0"].y_cust In [236]: scores_df = pd."PrecipitationIn".figsize=(15.append(run_models("subscriber".concat([X.DataFrame(dummy_coding(X_cust.X_cust.["day_of_week_mon pd."Mean TemperatureF".drop(["total_rides".drop(["total_rides".total_rides y=X.dropna(inplace=True) X.show() In [235]: X_cust.usertype=="Customer"] X=X[["total_rides"."temp_prec_birth_dow". X_cust=pd.In [234]: X=df_divvy_group[df_divvy_group.pd.scatter_matrix(X."PrecipitationIn".append(run_models("customer".usertype=="Subscriber"] X_cust=df_divvy_group[df_divvy_group.pd.dropna(inplace=True) y_cust=X_cust.inplace=True) X."birth_year_diff_8 X_cust=X[["total_rides"."day_of_week_mon_is_0"].plotting.total_rides X_cust.DataFrame(dummy_coding(X.

431729 01 0.003612 05 0.000831 0 all_features subscriber lasso 1.034747 3 all_except_dow subscriber gbr 1.000510 1 all_features subscriber ridge 2.732589e1.900000e-07 0.sort("mse_test") Out[240]: feature_set usertype model mse_train mse_test rmse_train rmse_test 1 all_except_dow subscriber ridge 2.425823 01 0.200000e2.364504 1 temp_prec_birth_dow subscriber ridge 1.001158 06 0.335119e-01 0.600000e-07 0.305000e1.744810e1.000447 07 0.001158 06 0.434022e-01 0.328632e-01 0.362944 2 temp_prec_birth_dow subscriber en 1.300000e-07 0.000574 0 all_except_dow subscriber lasso 1.000469 07 0.214040e-03 0.378685 2 temp_prec_birth_dow customer 3 temp_prec_birth_dow subscriber gbr 4.In [240]: scores_df.417032e-01 0.376435 0 temp_prec_birth_dow customer lasso 1.417001 01 0.813251e1.415050 .417709 01 0.340000e6.745095e1.431364e-01 0.000831 3 all_features subscriber gbr 1.207340e-03 0.034843 en 1.003217 05 0.001158 06 0.340000e6.340000e6.722666e-01 0.900000e-07 0.900000e-07 0.200176 0.863900e1.001158 06 0.416244 01 0.417743 01 0.000831 2 all_features subscriber en 1.000831 2 all_except_dow subscriber en 1.007046e02 1.035000e1.000000e3.900000e-07 0.378334 0 temp_prec_birth_dow subscriber lasso 1.340000e6.317287e-01 0.738899e1.365393 1 temp_prec_birth_dow customer ridge 1.

456125 2 temp_prec_birth subscriber en 2.456885 01 0.941157e-01 0.000000e2.629435e2.056549e-01 0.748168e3.374407 01 0.000316 07 0.751552e3.641917e1.096775e-01 0.746825e3.602448 01 0.000000e1.087435e2.524102 01 0.495963e3.586944 01 0.090449e2.552861 0 temp_prec_birth customer lasso 2.561286 0 all_except_dow customer lasso 3.524230 01 0.542324 3 temp_prec_birth customer gbr 1.957464e-01 0.289846e- .401808e3.442099 3 all_except_dow customer 1.556487 1 temp_prec_birth customer ridge 2.785863 3 temp_prec_birth_dow customer gbr 3.543861e1.458656 01 0.060788e-01 0.gbr 2.150416e-01 0.162540 02 0.954518e-01 0.524552 01 0.097234e-01 0.559040 2 all_features customer en 3.442432 0 temp_prec_birth subscriber lasso 2.457956 1 temp_prec_birth subscriber ridge 2.213163 02 0.125255e-01 0.334332e-01 0.591267 01 0.103654e2.000300 08 0.458804 3 all_features customer gbr 9.175809e-01 0.105014e-01 0.080503e-01 0.553244 2 temp_prec_birth customer en 2.483149 2 all_except_dow customer en 3.445035e6.436268 3 temp_prec_birth subscriber gbr 4.903296e-01 0.457214 01 0.

946653e1.542831 01 1.573572 0.017038 Analysis of results 1.555575 01 0.0 all_features customer lasso 01 7.086640e8. For user type : Subscriber a) It's interesting to see that the subscriber model that performed the best (and best overall compared to customer models as well) was the one with the entire feature set (except day_of_week_mon_is_0)- .034366e+00 0.940670 1 all_features customer ridge 2.848603e-01 0.095680e-01 0.842359 1 all_except_dow customer ridge 3.

001. e) Models created with only the weather data performed on the lower end of the spectrum for Subscribers. 'Max Wind SpeedMPH'. tol: 0. For the user type : Customer . 'total_rides'. 'male'. 'day_of_week_mon_is_0'. MSE: 2. 'Min VisibilityMiles'. 'Max Sea Level PressureIn'. 'Min Sea Level PressureIn'.0001 d) Another interesting fact to note is that the top performing models for the Subscriber user were all linear models. 'Max Humidity'. 'CDT'. 'Mean VisibilityMiles'. 'WindDirDegrees'.In [188]: list(df_divvy_group.Ridge Regression. 'Min DewpointF'. 'Max Dew PointF'. 'birth_year_diff_86'. 'date'. 'IsWeekend'] b) The best performing model was. 'avg_trip_duration_s'.000510 c) Tuned Parameters. 'female'. 'PrecipitationIn'. 'Max VisibilityMiles'. RMSE: 0. 'Mean Humidity'. 'avg_distance_m'. 'day_of_year'. 'Events'.alpha: 0. 'Mean TemperatureF'. 'Max Gust SpeedMPH'. 'Mean Wind SpeedMPH'. 'Min TemperatureF'. 'MeanDew PointF'.columns) Out[188]: ['usertype'. 'CloudCover'. 'Min Humidity'.600000e-07. 'Mean Sea Level PressureIn'. This shows that Subscribers are less influenced by changes in weather conditions when it comes to renting Divvy bikes. 2. 'Max TemperatureF'.

and is only a labeling issue.usert Out[249]: {'alpha': 0. These rental decisions can be affected by weather conditions..1.a) The best performing model for Customer was trained with only weather data and day of week. . Note: The scores list the best customer model with feature set inclusive of birth year. and hence decide on the fly about renting a bike. MSE: 1. b) The best performing model was.317287e-01. as can be seen from total rider volume by user type for a given day of the week (EDA section). RMSE: 0. 'l1_ratio': 0.080714285714285711.feature_set=="temp_prec_birth_dow") & (scores_df. This is not true for the model. 'tol': 0.0785928571 42857145} d) In the case of models for Customers it can be noted that the top performing models are all linear models. e) Customers are people who rent only for the day. as customers could be visitors etc.Elastic Net. who are not prepared for the weather.362944 c) Tuned Parameters- In [249]: scores_df[(scores_df. Subscribers on the other hand are using bikes more for commuting to work.