Professional Documents
Culture Documents
Case Study-Cyclistic Bike Sharing Company
Case Study-Cyclistic Bike Sharing Company
Cyclistic
Bike-sharing Company
The primary data source for this project is historical bicycle trip data
from Cyclistic, which can be downloaded via the provided link,
https://divvy-tripdata.s3.amazonaws.com/index.html. The dataset
includes information about bike trips taken in the past 12 months,
such as:
• Ride ID
• Rideable type
• Trip, start time (Include Date and Time in one cell)
• Trip, end time (Include Date and Time in one cell)
• Trip start station name
• Start station ID
• Trip end station name
• End station ID
• Start Latitude
• Start Longitude
• End Latitude
• End Longitude
Data Cleaning, Manipulation & Aggregation
The bike trip data for the past 12 months was consolidated into a
join table to simplify the cleaning and analysis process. Duplicates
were identified and removed using the DISTINCT * function. The
data was checked for misspellings and null values using the COUNT,
GROUP BY, HAVING > 1, and LENGTH functions. New variables were
created to include the trip length, station names from start to end,
and the day of the week as a number. A new table was created with
no NULL values as many observations were missing station names
or IDs, and trip duration (length with negative values) making
analysis difficult. About 1.35 million rows out of 5.75 million
observations were eliminated because values were out of range or
essential information was missing.
Data Analysis
The busiest station is Streeter Dr. & Grand Ave, with 71,534 rides.
Casual members ride more frequently to this station, as well as to
DuSable Lake Shore Dr. & Monroe St and Millennium Park. Annual
members ride more frequently to Kingsbury St & Kinzie St, Clark St
& Elm St, and Wells St & Concord Ln.
The classic bike is the most popular type of bike, with 2.6 million
rides, followed by electric bikes with 1.6 million rides and docked
bikes with 175,597 rides. In general, classic bikes are used for an
average of 16 minutes and 30 seconds, electric bikes for almost 13
minutes, and docked bikes for about 49 minutes and 30 seconds. In
general riders use classic bikes the most.
Key Points:
We used a clean sample of 4,437,463 observations for the analysis, of
which 1,775,175 were casual members and 2,662,288 were annual
members. Some key points in the analysis include:
• The busiest days for Cyclistic are Sundays, with 709,611 rides,
and Fridays, with 654,383 rides, while Tuesday is the least busy
day with 595,988 rides.
• The busiest stations are Streeter Dr & Grand Ave with 71,534
rides, DuSable Lake Shore Dr & Monroe St with 39,370 rides,
DuSable Lake Shore Dr & North Blvd with 37,789 rides,
Michigan Ave & Oak St with 37,370 rides, and Wells St &
Concord Ln with 34,877 rides.
• The most used bike for rides is the classic bike, with 2,632,924
rides, followed by electric bikes, with 1,628,942 rides, and
docked bikes, with 175,597 rides.
Based on the key points, here are the top three recommendations
for Cyclistic to design a new marketing strategy to convert casual
riders into annual members:
Target busiest days and stations: Since Sundays and Mondays are
the days when casual riders use the bikes the most, Cyclistic could
target these days with a special promotion or discount for annual
membership sign-ups. Similarly, since the busiest stations are used
most by casual riders, Cyclistic could target these stations with
promotional offers that encourage casual riders to upgrade to an
annual membership.
Overall, the focus should be on providing attractive incentives for
casual riders to become annual members, while also highlighting the
benefits of annual membership and targeting the busiest days and
stations with promotional offers.
Visualizations of Key Insights
2,662,288
1,775,175
23.50
12.00
Note casual Riders almost doubles annual members ride length on average. This means the
right offer might just convert them to annual members.
304,432
250,702
231,915
213,359 206,447 199,916
Causal rides takes place mostly on Sundays and Mondays. This means does are the perfect
days to reach them.
55,231
30,374
24,093 23,820 22,196
19,640 17,354 14,952 13,303 12,838
Shore Dr &…
Shore Dr &…
Concord Ln
DuSable Lake
DuSable Lake
Dusable
Michigan Ave
Streeter Dr &
Aquarium
Theater on the
Roosevelt Rd
Harbor
Wells St &
Grand Ave
Shedd
& Oak St
Park
Lake
Casual Riders Start their ride the most from the Streeter Dr. & Grand Ave. This station would
be the perfect place to approach them.
Kingsbury St &
Kinzie St
Wells St &
Concord Ln
19,925
Clinton St &
Washington
19,591
Blvd
Loomis St &
Lexington St
18,668
University Ave
& 57th St
18,665
Annual Members
Clinton St &
Madison St
18,276
Broadway &
Barry Ave
16,380
APPENDIX – BIG QUERY SQL DOCUMENTATION
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202301`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202212`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202211`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202210`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202209`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202208`
UNION ALL
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202206`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202205`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202204`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202203`
UNION ALL
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_202202`;
2ND STEP – CHECK FOR DUPLICATES
--By running this query, you will get a table with all the columns, and COUNT(*). Each row corresponds to a unique combination
of values, and the COUNT(*) column shows the number of occurrences of that combination.
--If the value in the COUNT(*) column is greater than 1, then there are duplicate rows in the table. You can then take
appropriate action to remove the duplicates, such as deleting them or merging them into a single row.
--Note that in this example, all columns are included in the grouping, which means that the query will only identify exact
duplicates, where all columns have the same values. If you want to identify duplicates based on specific columns, you can
modify the GROUP BY clause accordingly. For this specific case there aren’t any duplicates
SELECT *, COUNT(*)
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
3RD STEP – CHECK FOR MISSPELLS IN COLUMNS WITH STRING AND NULL VALUES
--We can check for misspells using GROUP BY and COUNT functions to look at all the string columns. Also, use to identify extra
spaces and characters. To group by and count the frequency of string values in a specific column, you can use the following SQL
queries to check all the string format columns:
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
GROUP BY rideable_type;
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
GROUP BY start_station_name ;
--There aren’t any misspells nor cells with extra characters. But 843525 cells are null out of 5,754,248
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
GROUP BY start_station_id;
--There aren’t any misspells nor cells with extra characters. But 843525 cells are null out of 5,754,248
--With no Start Station ID nor a Start Station Name, we can’t find the missing values.
SELECT end_station_name, COUNT(*) AS frequency
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
GROUP BY end_station_name;
--There aren’t any misspells nor cells with extra characters. But 902655 cells are null out of 5,754,248
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
GROUP BY end_station_id;
--There aren’t any misspells nor cells with extra characters. But 902655 cells are null out of 5,754,248
--With no End Station ID nor End Station Name, We can’t find the missing values.
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
GROUP BY member_casual;
--There aren’t any misspells nor cells with extra characters. For the member_causal column, casual riders 2343520 and
members 3410728
SELECT
ride_id
FROM
`fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`
WHERE
--The new table has four additional columns as compared to the original table, created using the SELECT statement:
--trip_station_names: This column is created using the CONCAT function to concatenate the values in the start_station_name
and end_station_name columns. The resulting string is separated by a hyphen (-).
--ride_length: This column is created using the TIMESTAMP_DIFF functions. It calculates the time difference between the
started_at and ended_at columns in minutes, and then formats the result as a interger
--day_of_week: This column is created using the EXTRACT function to extract the day of the week from the started_at column.
The DAYOFWEEK format code is used to specify that the day of the week should be returned as a number between 1 and 7,
where Monday is 1 and Sunday is 7.
--The * in the SELECT statement indicates that all the columns from the original cyclistic_trips_all table should also be included
in the new table.
--Overall, this SQL statement is creating a new table with additional columns that provide more insights into the
cyclistic_trips_all data. The new columns can be used to analyze the data and gain insights that were not possible with the
original table.
SELECT
*,
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all`;
6TH STEP – WE CREATE A NEW TABLE BUT WITHOUT NULL VALUES FOR START STATION AND
END STATION.
--The SQL query you provided is creating a new table called cyclistic_trips_all_V3 in the cyclistic_trips dataset of the fourth-
blend-376817 project. The data for this new table is derived from an existing table called cyclistic_trips_all_V2, which is also in
the same dataset.
--The SELECT statement is selecting all columns from the cyclistic_trips_all_V2 table without modifying them. The WHERE
clause is filtering out any rows where the start_station_name, start_station_id, end_station_name, and end_station_id
columns are all NULL. This means that the resulting table will only contain rows where all four of these columns have a non-
NULL value.
SELECT *
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all_V2`
--The COUNT(*) function is an aggregate function that counts the number of rows in the table that meet the specified condition
in the WHERE clause. In this case, the condition is that the ride_length is less than zero, which would indicate that there is an
error in the data, as it is not possible for a ride to have a negative duration.
SELECT COUNT(*)
FROM `fourth-blend-376817.cyclistic_trips.cyclistic_trips_all_clean`
--We find that 53 rows have values below zero. This query will delete all the rows where ride_length is less than zero from the
cyclistic_trips_all_clean table.