You are on page 1of 21

Case Study

Cyclistic
Bike-sharing Company

by Ramberto Jr. Sosa


Table of Contents
Case Study Business Task 3
Data Sources & Description 4
Data Cleaning, Manipulation & Aggregation 5
Data Analysis 6
Recommendations Base on the Analysis 10
Visualizations of Key Insights 12
APPENDIX – BIG QUERY SQL DOCUMENTATION 16
Case Study Business Task

The goal of the business task is to create a new marketing strategy


that will turn occasional riders into annual subscribers for Cyclistic,
a hypothetical company. The company's director of marketing
believes that increasing the number of yearly subscriptions is crucial
for the long-term success of the company. To achieve this goal, my
marketing analyst team, also hypothetical, must understand why
casual riders would want to become members, and how annual
members use Cyclistic bikes differently from occasional riders.

To gain insights that can guide the development of the new


marketing strategy, we plan to analyze past bike trip data from
Cyclistic. Our aim is to identify patterns and insights that can help
us produce compelling data visualizations and expert insights. These
visualizations and insights will be used to convince Cyclistic's
management to approve the new marketing strategy.
Data Sources & Description

The primary data source for this project is historical bicycle trip data
from Cyclistic, which can be downloaded via the provided link,
https://divvy-tripdata.s3.amazonaws.com/index.html. The dataset
includes information about bike trips taken in the past 12 months,
such as:

• Ride ID
• Rideable type
• Trip, start time (Include Date and Time in one cell)
• Trip, end time (Include Date and Time in one cell)
• Trip start station name
• Start station ID
• Trip end station name
• End station ID
• Start Latitude
• Start Longitude
• End Latitude
• End Longitude
Data Cleaning, Manipulation & Aggregation

For the data cleaning, manipulation and aggregation we perform the


following in SQL:

The bike trip data for the past 12 months was consolidated into a
join table to simplify the cleaning and analysis process. Duplicates
were identified and removed using the DISTINCT * function. The
data was checked for misspellings and null values using the COUNT,
GROUP BY, HAVING > 1, and LENGTH functions. New variables were
created to include the trip length, station names from start to end,
and the day of the week as a number. A new table was created with
no NULL values as many observations were missing station names
or IDs, and trip duration (length with negative values) making
analysis difficult. About 1.35 million rows out of 5.75 million
observations were eliminated because values were out of range or
essential information was missing.
Data Analysis

The analysis is based on a sample of 4.4 million observations, with


1.8 million casual member rides and 2.7 million annual member
rides. The average ride length is 16 minutes and 30 seconds, with a
maximum of 572 hours and 30 minutes. Casual riders have a longer
average ride length of 23 minutes and 30 seconds, while annual
members have an average ride length of 12 minutes. The busiest
days of the week for Cyclistic are Sundays and Fridays, with the least
busy being Tuesdays.

The busiest station is Streeter Dr. & Grand Ave, with 71,534 rides.
Casual members ride more frequently to this station, as well as to
DuSable Lake Shore Dr. & Monroe St and Millennium Park. Annual
members ride more frequently to Kingsbury St & Kinzie St, Clark St
& Elm St, and Wells St & Concord Ln.

The classic bike is the most popular type of bike, with 2.6 million
rides, followed by electric bikes with 1.6 million rides and docked
bikes with 175,597 rides. In general, classic bikes are used for an
average of 16 minutes and 30 seconds, electric bikes for almost 13
minutes, and docked bikes for about 49 minutes and 30 seconds. In
general riders use classic bikes the most.
Key Points:
We used a clean sample of 4,437,463 observations for the analysis, of
which 1,775,175 were casual members and 2,662,288 were annual
members. Some key points in the analysis include:

• The average ride length for users is about 16 minutes and 30


seconds, with a maximum ride length of about 572 hours and
30 minutes.

• Casual riders have an average ride length of 23 minutes and 30


seconds, with a maximum ride length of about 572 hours and
30 minutes.

• Annual members have an average ride length of 12 minutes,


with a maximum ride length of about 25 hours.

• The busiest days for Cyclistic are Sundays, with 709,611 rides,
and Fridays, with 654,383 rides, while Tuesday is the least busy
day with 595,988 rides.

• On Sundays and Mondays, riders have the longest ride length,


with an average of 20 minutes, while Thursdays and
Wednesdays have the shortest ride length, with about 14
minutes.

• Casual riders are most active on Sundays, with a total of 368,404


rides, and Mondays, with a total of 304,432 rides. The day with
the least activity for casual riders is Wednesday, with 199,916
rides and an average ride length of about 21 minutes.

• Annual member riders are most active on Wednesdays, with a


total of 424,050 rides, Fridays, with a total of 422,468 rides, and
Thursdays, with a total of 422,212 rides. They have an average
ride length of about 11 minutes in general. The day with the
least activity for member riders is Sunday, with 341,207 rides,
and Monday, with 302,990 rides, with an average ride length of
about 13 minutes.

• The busiest stations are Streeter Dr & Grand Ave with 71,534
rides, DuSable Lake Shore Dr & Monroe St with 39,370 rides,
DuSable Lake Shore Dr & North Blvd with 37,789 rides,
Michigan Ave & Oak St with 37,370 rides, and Wells St &
Concord Ln with 34,877 rides.

• Casual riders use the following stations the most: Streeter Dr


& Grand Ave, 55,231 rides; DuSable Lake Shore Dr & Monroe St,
30,374 rides; Millennium Park, 24,093 rides; Michigan Ave & Oak
St, 23,820 rides.

• Annual member riders use the following stations the most:


Kingsbury St & Kinzie St, 23,774 rides; Clark St & Elm St, 20,983
rides; Wells St & Concord Ln, 19,925 rides; Clinton St &
Washington Blvd, 19,591 rides.

• The most used bike for rides is the classic bike, with 2,632,924
rides, followed by electric bikes, with 1,628,942 rides, and
docked bikes, with 175,597 rides.

• In average, classic bikes are ridden for a length of 16 minutes


and 30 seconds, electric bikes are ridden for an average of
almost 13 minutes, and docked bikes are ridden for about 49
minutes and 30 seconds.
• Casual members rode classic bikes in the sample for 895,732
rides, with an average length of almost 24 minutes. They also
rode 703,846 electric bikes for an average length of about 16
minutes. Docked bikes were ridden by casual members 175,597
times, with an average length of 49 minutes and 30 seconds.

• Annual members rode classic bikes for an average of 12


minutes and 40 seconds, and electric bikes for 10.
Recommendations Base on the Analysis

Based on the key points, here are the top three recommendations
for Cyclistic to design a new marketing strategy to convert casual
riders into annual members:

Offer discounts: Provide discounts to encourage casual riders to


become annual members. Since casual riders use the bikes for longer
periods, offering them a discount on the annual membership could
be a cost-effective way to incentivize them to switch to an annual
membership.

Promote benefits of annual membership: Highlight the benefits of


annual membership, such as cheaper fares, to persuade casual riders
to become annual members. Cyclistic could use various marketing
channels, such as email campaigns and social media, to educate
casual riders about the benefits of annual membership and why it
makes sense to switch.

Target busiest days and stations: Since Sundays and Mondays are
the days when casual riders use the bikes the most, Cyclistic could
target these days with a special promotion or discount for annual
membership sign-ups. Similarly, since the busiest stations are used
most by casual riders, Cyclistic could target these stations with
promotional offers that encourage casual riders to upgrade to an
annual membership.
Overall, the focus should be on providing attractive incentives for
casual riders to become annual members, while also highlighting the
benefits of annual membership and targeting the busiest days and
stations with promotional offers.
Visualizations of Key Insights

Casual Riders & Members


Sample

2,662,288

1,775,175

Casual Riders Annual Members

Average Ride Length (minutes)


Casual vs Member

23.50

12.00

Casual Riders Anual Members

Note casual Riders almost doubles annual members ride length on average. This means the
right offer might just convert them to annual members.

Total Number of Rides


per Day of the Week
Casual Riders
368,404

304,432

250,702
231,915
213,359 206,447 199,916

SUNDAY MONDAY SATURDAY FRIDAY TUESDAY THURSDAY WEDNESDAY

Causal rides takes place mostly on Sundays and Mondays. This means does are the perfect
days to reach them.

Total Number of Rides


per Day of the Week
Annual Members

424,050 422,468 422,212


382,629 366,732
341,207
302,990

WEDNESDAY FRIDAY THURSDAY TUESDAY SATURDAY SUNDAY MONDAY


Annual members present different characteristics. The ride less on Sundays and Mondays
contrary to their counterpart.

Top 10 - Start Station per Rides


Casual Riders

55,231

30,374
24,093 23,820 22,196
19,640 17,354 14,952 13,303 12,838
Shore Dr &…

Shore Dr &…

Concord Ln
DuSable Lake

DuSable Lake

Dusable
Michigan Ave
Streeter Dr &

Aquarium

Indiana Ave &


Millennium

Theater on the

Roosevelt Rd
Harbor
Wells St &
Grand Ave

Shedd
& Oak St
Park

Lake

Casual Riders Start their ride the most from the Streeter Dr. & Grand Ave. This station would
be the perfect place to approach them.
Kingsbury St &
Kinzie St

Clark St & Elm 23,774


St 20,983

Wells St &
Concord Ln
19,925

Clinton St &
Washington
19,591

Blvd

Loomis St &
Lexington St
18,668

University Ave
& 57th St
18,665
Annual Members

Annual members use different Stations to start their rides.


Ellis Ave &
60th St
18,517
Top 10 - Start Station per Rides

Clinton St &
Madison St
18,276

Wells St & Elm


St
17,892

Broadway &
Barry Ave
16,380
APPENDIX – BIG QUERY SQL DOCUMENTATION

1ST STEP – JOIN ALL THE MONTHLY DATA


--Before cleaning the data we join all the available data on Cyclistic. To make the job easier and less prone to mistakes. To join
multiple tables in the same query, you can use the JOIN keyword to specify the join conditions between the tables. The join
conditions specify which columns in the tables should be used to match the rows in the tables.

CREATE TABLE `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all` AS

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202301`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202212`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202211`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202210`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202209`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202208`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202207`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202206`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202205`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202204`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202203`

UNION ALL

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id,


start_lat, start_lng, end_lat, end_lng, member_casual

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202202`;
2ND STEP – CHECK FOR DUPLICATES
--By running this query, you will get a table with all the columns, and COUNT(*). Each row corresponds to a unique combination
of values, and the COUNT(*) column shows the number of occurrences of that combination.

--If the value in the COUNT(*) column is greater than 1, then there are duplicate rows in the table. You can then take
appropriate action to remove the duplicates, such as deleting them or merging them into a single row.

--Note that in this example, all columns are included in the grouping, which means that the query will only identify exact
duplicates, where all columns have the same values. If you want to identify duplicates based on specific columns, you can
modify the GROUP BY clause accordingly. For this specific case there aren’t any duplicates

SELECT *, COUNT(*)

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

GROUP BY ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name,


end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual

HAVING COUNT(*) > 1;

3RD STEP – CHECK FOR MISSPELLS IN COLUMNS WITH STRING AND NULL VALUES
--We can check for misspells using GROUP BY and COUNT functions to look at all the string columns. Also, use to identify extra
spaces and characters. To group by and count the frequency of string values in a specific column, you can use the following SQL
queries to check all the string format columns:

SELECT rideable_type, COUNT(*) AS frequency

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

GROUP BY rideable_type;

--There aren’t any misspells nor cells with extra characters.

SELECT start_station_name, COUNT(*) AS frequency

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

GROUP BY start_station_name ;

--There aren’t any misspells nor cells with extra characters. But 843525 cells are null out of 5,754,248

SELECT start_station_id, COUNT(*) AS frequency

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

GROUP BY start_station_id;

--There aren’t any misspells nor cells with extra characters. But 843525 cells are null out of 5,754,248

--With no Start Station ID nor a Start Station Name, we can’t find the missing values.
SELECT end_station_name, COUNT(*) AS frequency

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

GROUP BY end_station_name;

--There aren’t any misspells nor cells with extra characters. But 902655 cells are null out of 5,754,248

SELECT end_station_id, COUNT(*) AS frequency

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

GROUP BY end_station_id;

--There aren’t any misspells nor cells with extra characters. But 902655 cells are null out of 5,754,248

--With no End Station ID nor End Station Name, We can’t find the missing values.

SELECT member_casual, COUNT(*) AS frequency

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

GROUP BY member_casual;

--There aren’t any misspells nor cells with extra characters. For the member_causal column, casual riders 2343520 and
members 3410728

4TH STEP – WE CHECK MISSPELL IN RIDE IDs


--To check if any strings are misspell for the specified columns, you can use the following SQL query:

SELECT

ride_id

FROM

`fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`

WHERE

LENGTH(ride_id) < 15 OR LENGTH(ride_id)>16


5TH STEP – WE CREATE NEW COLUMNS FOR THE ANALYSIS.
--This SQL statement is creating a new table called cyclistic_trips_all_V2 in the cyclistic_trips dataset of the fourth-blend-
376817 project. The data for this table is derived from an existing table called cyclistic_trips_all, which is also in the same
dataset.

--The new table has four additional columns as compared to the original table, created using the SELECT statement:

--trip_station_names: This column is created using the CONCAT function to concatenate the values in the start_station_name
and end_station_name columns. The resulting string is separated by a hyphen (-).

--ride_length: This column is created using the TIMESTAMP_DIFF functions. It calculates the time difference between the
started_at and ended_at columns in minutes, and then formats the result as a interger

--day_of_week: This column is created using the EXTRACT function to extract the day of the week from the started_at column.
The DAYOFWEEK format code is used to specify that the day of the week should be returned as a number between 1 and 7,
where Monday is 1 and Sunday is 7.

--The * in the SELECT statement indicates that all the columns from the original cyclistic_trips_all table should also be included
in the new table.

--Overall, this SQL statement is creating a new table with additional columns that provide more insights into the
cyclistic_trips_all data. The new columns can be used to analyze the data and gain insights that were not possible with the
original table.

CREATE TABLE `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all_V2` AS

SELECT

*,

CONCAT(start_station_name, ' - ', end_station_name) AS trip_station_names,

TIMESTAMP_DIFF(ended_at, started_at, MINUTE) AS ride_length,

EXTRACT(DAYOFWEEK FROM started_at) AS day_of_week

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`;
6TH STEP – WE CREATE A NEW TABLE BUT WITHOUT NULL VALUES FOR START STATION AND
END STATION.
--The SQL query you provided is creating a new table called cyclistic_trips_all_V3 in the cyclistic_trips dataset of the fourth-
blend-376817 project. The data for this new table is derived from an existing table called cyclistic_trips_all_V2, which is also in
the same dataset.

--The SELECT statement is selecting all columns from the cyclistic_trips_all_V2 table without modifying them. The WHERE
clause is filtering out any rows where the start_station_name, start_station_id, end_station_name, and end_station_id
columns are all NULL. This means that the resulting table will only contain rows where all four of these columns have a non-
NULL value.

CREATE TABLE `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all_clean` AS

SELECT *

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all_V2`

WHERE start_station_name IS NOT NULL

AND start_station_id IS NOT NULL

AND end_station_name IS NOT NULL

AND end_station_id IS NOT NULL;

By doing this means 4,437,516 are left out of a total of 5,754,248

7TH STEP – CHECK FOR OUT OF RANGE VALUES IN RIDE LENGTH.


--The query is selecting the count of all rows in the cyclistic_trips_all_clean table where the ride_length is less than zero.

--The COUNT(*) function is an aggregate function that counts the number of rows in the table that meet the specified condition
in the WHERE clause. In this case, the condition is that the ride_length is less than zero, which would indicate that there is an
error in the data, as it is not possible for a ride to have a negative duration.

SELECT COUNT(*)

FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all_clean`

WHERE ride_length < 0

--We find that 53 rows have values below zero. This query will delete all the rows where ride_length is less than zero from the
cyclistic_trips_all_clean table.

DELETE FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all_clean` WHERE ride_length < 0;

You might also like