You are on page 1of 11

cohort-analysis

April 21, 2024

0.1 Cohort Analysis by Shromana Majumder


0.1.1 What is Cohort and Cohort Analysis?
A cohort is a collection of users who have something in common. A traditional cohort, for
example, divides people by the week or month of which they were first acquired. When referring
to non-time-dependent groupings, the term segment is often used instead of the cohort.
Cohort analysis is a descriptive analytics technique in cohort analysis. Customers are divided into
mutually exclusive cohorts, which are then tracked over time. It aids in the deeper interpretation
of high-level patterns by supplying metrics around the product and consumer lifecycle.
It is a method to group customers or users into cohorts based on shared characteristics or experiences
within a defined time-span. These cohorts are then tracked over time to observe changes in behavior,
usage, or other key metrics.

0.1.2 The major types of Cohort


• Time cohorts: customers who signed up for a product or service during a particular time
frame.
• Behavior cohorts: customers who purchased a product or subscribed to service in the past.
• Size cohorts: refer to the various sizes of customers who purchase the company’s products or
services.
• Funnel-Based cohorts : puts people into the individual groups according to a funnel’s stages.

0.1.3 Dataset
The dataset contains user interaction data, including metrics such as the number of new and
returning users, and their engagement durations on Day 1 and Day 7. Key columns in the dataset
are: * Date: The specific dates of user interactions. * New Users: The count of new users for
each date. * Returning Users: The count of users returning on each date. * Duration Day 1: The
average duration (possibly in minutes or seconds) of user interaction on their first day. * Duration
Day 7: The average duration of user interaction on their seventh day.

0.1.4 Task:
1. Identify trends in the acquisition of new users and the retention of returning users on a weekly
basis.
2. Understand how user engagement, as indicated by the average duration of interaction, evolves
from the first day to the seventh day of usage.

1
3. Detect any significant weekly patterns or anomalies in user behavior and engagement, and
investigate the potential causes behind these trends.
4. Explore the relationship between user retention (returning users) and engagement (duration
metrics), to assess the effectiveness of user engagement strategies.
5. Provide actionable insights that can guide marketing efforts, content strategies, and user
experience improvements.

0.1.5 Importing Libraries

[15]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt

0.1.6 Load Dataset


[16]: data = pd.read_csv("cohorts.csv")
data.head()

[16]: Date New users Returning users Duration Day 1 Duration Day 7
0 25/10/2023 3461 1437 202.156977 162.523809
1 26/10/2023 3777 1554 228.631944 258.147059
2 27/10/2023 3100 1288 227.185841 233.550000
3 28/10/2023 2293 978 261.079545 167.357143
4 29/10/2023 2678 1082 182.567568 304.350000

[17]: ### Checking missing values


missing_values = data.isnull().sum()
print(missing_values)

Date 0
New users 0
Returning users 0
Duration Day 1 0
Duration Day 7 0
dtype: int64

[18]: ## Checking data types


data_types = data.dtypes
print(data_types)

Date object
New users int64
Returning users int64
Duration Day 1 float64
Duration Day 7 float64
dtype: object

2
0.2 Data Preparation
The Date column is in object (string) format. For effective analysis, I would convert this to a
datetime format.
[19]: # Convert 'Date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'], format='%d/%m/%Y')

[20]: # Display the descriptive statistics of the dataset


descriptive_stats = data.describe()
descriptive_stats

[20]: Date New users Returning users Duration Day 1 \


count 30 30.000000 30.000000 30.000000
mean 2023-11-08 12:00:00 3418.166667 1352.866667 208.259594
min 2023-10-25 00:00:00 1929.000000 784.000000 59.047619
25% 2023-11-01 06:00:00 3069.000000 1131.500000 182.974287
50% 2023-11-08 12:00:00 3514.500000 1388.000000 206.356554
75% 2023-11-15 18:00:00 3829.500000 1543.750000 230.671046
max 2023-11-23 00:00:00 4790.000000 1766.000000 445.872340
std NaN 677.407486 246.793189 64.730830

Duration Day 7
count 30.000000
mean 136.037157
min 0.000000
25% 68.488971
50% 146.381667
75% 220.021875
max 304.350000
std 96.624319

• New Users: The average number of new users is around 3,418 with a standard deviation of
approximately 677. The minimum and maximum new users recorded are 1,929 and 4,790,
respectively.
• Returning Users: On average, there are about 1,353 returning users, with a standard deviation
of around 247. The minimum and maximum are 784 and 1,766, respectively.
• Duration Day 1: The average duration on the first day is about 208 seconds with a consider-
able spread (standard deviation is around 65).
• Duration Day 7: The average 7-day duration is lower, around 136 seconds, with a larger
standard deviation of about 97. The range is from 0 to 304

0.3 Trend of the new and returning users over time


[9]: # Plot
plt.figure(figsize=(10, 6))

# New Users

3
plt.plot(data['Date'], data['New users'], marker='o', linestyle='-', label='New␣
↪Users')

# Returning Users
plt.plot(data['Date'], data['Returning users'], marker='o', linestyle='-',␣
↪label='Returning Users')

# Add title and labels


plt.title('Trend of New and Returning Users Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Users')

# Rotate x-axis labels for better readability


plt.xticks(rotation=45)

# Add legend
plt.legend()

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

4
0.4 Trend of duration over time
[10]: # Plot
plt.figure(figsize=(10, 6))

# Duration Day 1
plt.plot(data['Date'], data['Duration Day 1'], marker='o', linestyle='-',␣
↪label='Duration Day 1')

# Duration Day 7
plt.plot(data['Date'], data['Duration Day 7'], marker='o', linestyle='-',␣
↪label='Duration Day 7')

# Add title and labels


plt.title('Trend of Duration (Day 1 and Day 7) Over Time')
plt.xlabel('Date')
plt.ylabel('Duration')

# Rotate x-axis labels for better readability


plt.xticks(rotation=45)

# Add legend
plt.legend()

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

5
0.5 Correlation
[11]: # Correlation matrix
correlation_matrix = data.corr()

# Plotting the correlation matrix


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Variables')
plt.show()

• The strongest correlation is between the number of new and returning users, indicating a
potential trend of new users converting to returning users.

6
0.6 Cohorts
Grouping the data by the week of the year to create cohorts. Then, for each cohort (week), I’ll
calculate the average number of new and returning users, as well as the average of Duration Day 1
and Duration Day 7.

0.7 Grouping the data by week and calculating the necessary metrices
[12]: # Grouping data by week
data['Week'] = data['Date'].dt.isocalendar().week

# Calculating weekly averages


weekly_averages = data.groupby('Week').agg({
'New users': 'mean',
'Returning users': 'mean',
'Duration Day 1': 'mean',
'Duration Day 7': 'mean'
}).reset_index()

weekly_averages.head()

[12]: Week New users Returning users Duration Day 1 Duration Day 7
0 43 3061.800000 1267.800000 220.324375 225.185602
1 44 3503.571429 1433.142857 189.088881 168.723200
2 45 3297.571429 1285.714286 198.426524 143.246721
3 46 3222.428571 1250.000000 248.123542 110.199609
4 47 4267.750000 1616.250000 174.173330 0.000000

0.8 Weekly average of the new and returning users and the duration
[13]: # Plot 1: Weekly Average of New vs. Returning Users
plt.figure(figsize=(10, 6))

# New users
plt.plot(weekly_averages['Week'], weekly_averages['New users'], marker='o',␣
↪linestyle='-', label='New users')

# Returning users
plt.plot(weekly_averages['Week'], weekly_averages['Returning users'],␣
↪marker='o', linestyle='-', label='Returning users')

# Add title and labels


plt.title('Weekly Average of New vs. Returning Users')
plt.xlabel('Week of the Year')
plt.ylabel('Average Number of Users')

# Add legend

7
plt.legend()

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

# Plot 2: Weekly Average of Duration (Day 1 vs. Day 7)


plt.figure(figsize=(10, 6))

# Duration Day 1
plt.plot(weekly_averages['Week'], weekly_averages['Duration Day 1'],␣
↪marker='o', linestyle='-', label='Duration Day 1')

# Duration Day 7
plt.plot(weekly_averages['Week'], weekly_averages['Duration Day 7'],␣
↪marker='o', linestyle='-', label='Duration Day 7')

# Add title and labels


plt.title('Weekly Average of Duration (Day 1 vs. Day 7)')
plt.xlabel('Week of the Year')
plt.ylabel('Average Duration')

# Add legend
plt.legend()

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

8
0.9 Calculating Metrics
Determining which metrics to analyze over time. Common metrics for cohort analysis include
retention rate, average duration, etc.

9
0.10 Cohort chart to understand the cohort matrix of weekly averages
In the cohort chart, each row will correspond to a week of the year, and each column will represent
a different metric.
• Average number of new users.
• Average number of returning users.
• Average duration on Day 1.
• Average duration on Day 7.
[14]: # Creating a cohort matrix
cohort_matrix = weekly_averages.set_index('Week')

# Plotting the cohort matrix


plt.figure(figsize=(12, 8))

sns.heatmap(cohort_matrix, annot=True, cmap='coolwarm', fmt=".1f")


plt.title('Cohort Matrix of Weekly Averages')
plt.ylabel('Week of the Year')
plt.show()

We can see that the number of new users and returning users fluctuates from week to week. Notably,
there was a significant increase in both new and returning users in Week 47.

10
The average duration of user engagement on Day 1 and Day 7 varies across the weeks. The durations
do not follow a consistent pattern about the number of new or returning users, suggesting that other
factors might be influencing user engagement.
Benefits of Cohort Analysis
• Cohort analysis helps group customers for targeted strategies.
• Reveals patterns in user behavior over time.
• Informs improvements based on cohort responses.
• Evaluates campaign effectiveness for optimized resource allocation.
• Guides strategies for improved customer loyalty and retent.
[ ]:

11

You might also like