Professional Documents
Culture Documents
com/@yennhi95zz
Dataset:
You can use the Telco Customer Churn dataset from Kaggle. This dataset contains information
about telecom customers, including various features like contract type, monthly charges, and
whether the customer churned or not.
Objective:
The goal of this project is to predict customer churn (whether a customer will leave the telecom
service) using a model stacking approach. Model stacking involves training multiple models and
combining their predictions using another model.
1
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
Steps:
1. Import Libraries: Import necessary libraries and initialize Comet ML.
2. Load and Explore Data: Load dataset and perform exploratory data analysis (EDA).
4. Model Training: Train multiple machine learning models, including Logistic Regression,
Random Forest, Gradient Boosting, and Support Vector Machine.
This project will give you insights into dealing with classification problems, handling imbalanced
datasets (if applicable), and utilizing model stacking to enhance predictive performance.
0. Import Libraries
!pip install -q optuna comet_ml
import optuna
import comet_ml
from comet_ml import Experiment
ERROR: pip's dependency resolver does not currently take into account
all the packages that are installed. This behaviour is the source of
the following dependency conflicts.
jupyterlab-lsp 4.2.0 requires jupyter-lsp>=2.0.0, but you have
jupyter-lsp 1.5.1 which is incompatible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,
GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score, log_loss
2
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146:
UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this
version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and
<{np_maxversion}"
1. Initialize Comet ML
user_secrets = UserSecretsClient()
comet_api_key = user_secrets.get_secret("Comet API Key")
experiment = Experiment(
api_key= comet_api_key,
project_name="customer-churn",
workspace="yennhi95zz"
)
2. Load Data
# Load the dataset
data = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-
Telco-Customer-Churn.csv")
data.head()
3
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
3 7795-CFOCW Male 0 No No 45
No
4 9237-HQITU Female 0 No No 2
Yes
Churn
0 No
1 No
2 Yes
3 No
4 Yes
4
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
5
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
for p in ax.patches:
ax.annotate(f'{int(round(p.get_height()))}', (p.get_x() +
p.get_width() / 2., p.get_height()), ha='center', va='center',
fontsize=12, color='black', xytext=(0, 5), textcoords='offset points')
plt.tight_layout()
plt.show()
6
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
plt.tight_layout()
plt.show()
These plots provide insights into how different categories of customers (e.g., seniors vs. non-
seniors, customers with partners vs. without) are distributed in terms of churn. You can identify
potential customer segments that are more likely to churn.
7
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
experiment.log_figure(figure=plt)
plt.tight_layout()
plt.show()
8
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
plt.tight_layout()
plt.show()
It appears that customers who have higher Total Charges are less likely to churn. This suggests
that long-term customers who spend more are more loyal. You can use this insight to focus on
retaining high-value, long-term customers by offering loyalty programs or incentives. These
9
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
business insights derived from EDA can guide feature engineering and model selection for your
churn prediction project. They help you understand the data's characteristics and make informed
decisions to optimize customer retention strategies.
plt.tight_layout()
plt.show()
10
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
4. Preprocessing
# Encode categorical features, scale numerical features
X_train_encoded = encoder.fit_transform(X_train[categorical_features])
X_val_encoded = encoder.transform(X_val[categorical_features])
X_train_scaled = scaler.fit_transform(X_train[numerical_features])
X_val_scaled = scaler.transform(X_val[numerical_features])
/opt/conda/lib/python3.10/site-packages/sklearn/preprocessing/
_encoders.py:868: FutureWarning: `sparse` was renamed to
`sparse_output` in version 1.2 and will be removed in 1.4.
`sparse_output` is ignored unless you leave `sparse` to its default
value.
warnings.warn(
11
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
Modeling Stacking
In the project, I am stacking models such as random forests, gradient boosting, and support
vector machines, which have different characteristics and can capture different aspects of the
customer churn problem. This ensemble approach can help you achieve a more accurate and
robust churn prediction model, ultimately leading to better customer retention strategies and
business outcomes.
gb_params = {
'n_estimators': trial.suggest_int('gb_n_estimators', 100,
300),
'learning_rate': trial.suggest_float('gb_learning_rate', 0.01,
0.2),
'max_depth': trial.suggest_categorical('gb_max_depth', [3, 4,
5]),
}
svm_params = {
'C': trial.suggest_categorical('svm_C', [0.1, 1, 10]),
'kernel': trial.suggest_categorical('svm_kernel', ['linear',
'rbf']),
}
12
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
rf = RandomForestClassifier(**rf_params)
gb = GradientBoostingClassifier(**gb_params)
svm = SVC(probability=True, **svm_params)
rf_roc_auc = roc_auc_score(y_val,
rf.predict_proba(X_val_processed)[:, 1])
gb_roc_auc = roc_auc_score(y_val,
gb.predict_proba(X_val_processed)[:, 1])
svm_roc_auc = roc_auc_score(y_val,
svm.predict_proba(X_val_processed)[:, 1])
stacking_classifier = StackingClassifier(estimators=estimators,
final_estimator=LogisticRegression())
13
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
'rf_min_samples_split': rf_params['min_samples_split'],
'rf_min_samples_leaf': rf_params['min_samples_leaf'],
'gb_n_estimators': gb_params['n_estimators'],
'gb_learning_rate': gb_params['learning_rate'],
'gb_max_depth': gb_params['max_depth'],
'svm_C': svm_params['C'],
'svm_kernel': svm_params['kernel']
})
experiment.log_metrics({
'rf_accuracy': rf_accuracy,
'gb_accuracy': gb_accuracy,
'svm_accuracy': svm_accuracy,
'rf_roc_auc': rf_roc_auc,
'gb_roc_auc': gb_roc_auc,
'svm_roc_auc': svm_roc_auc,
'stacking_accuracy': stacking_accuracy,
'stacking_roc_auc': stacking_roc_auc
})
Clarify the optimization goal: You should mention whether you are minimizing or maximizing a
specific metric. In the code, I am using direction='minimize', which implies optimizing accuracy
(negative accuracy to minimize) AKA minimizing a loss or error metric. If you want to
maximize accuracy or ROC AUC, you should use direction='maximize'.
14
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
print(f"Best RF Hyperparameters:\n{best_rf_params}")
print(f"Best Accuracy: {best_accuracy}")
15
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
16
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
17
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
18
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
19
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
20
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
21
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
22
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
23
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
24
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
25
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
26
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
Best RF Hyperparameters:
+----------------------+---------------------+
| Parameter | Value |
+======================+=====================+
| rf_n_estimators | 300 |
+----------------------+---------------------+
| rf_max_depth | 20 |
+----------------------+---------------------+
| rf_min_samples_split | 8 |
+----------------------+---------------------+
| rf_min_samples_leaf | 2 |
+----------------------+---------------------+
| gb_n_estimators | 139 |
+----------------------+---------------------+
| gb_learning_rate | 0.09345289942291049 |
+----------------------+---------------------+
| gb_max_depth | 4 |
+----------------------+---------------------+
| svm_C | 1 |
+----------------------+---------------------+
| svm_kernel | rbf |
+----------------------+---------------------+
Best Accuracy: 0.7917555081734187
experiment.end()
COMET INFO:
----------------------------------------------------------------------
-----------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO:
----------------------------------------------------------------------
-----------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: url :
https://www.comet.com/yennhi95zz/customer-churn/ce4189deb57943d281df04
05dab75687
COMET INFO: Metrics [count] (min, max):
COMET INFO: gb_accuracy [100] : (0.7640369580668088,
0.7924662402274343)
COMET INFO: gb_roc_auc [100] : (0.8042899814154298,
27
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
0.8275504604728453)
COMET INFO: rf_accuracy [100] : (0.7604832977967306,
0.783226723525231)
COMET INFO: rf_roc_auc [100] : (0.7839880209762333,
0.8187654979267074)
COMET INFO: stacking_accuracy [100] : (0.7782515991471215,
0.7917555081734187)
COMET INFO: stacking_roc_auc [100] : (0.8136417992348748,
0.8247718342815433)
COMET INFO: svm_accuracy [100] : (0.7732764747690121,
0.7860696517412935)
COMET INFO: svm_roc_auc [100] : (0.7664970414813818,
0.8130671788208375)
COMET INFO: Parameters:
COMET INFO: C : 1.0
COMET INFO: bootstrap : True
COMET INFO: break_ties : False
COMET INFO: cache_size : 200
COMET INFO: categories : auto
COMET INFO: ccp_alpha : 0.0
COMET INFO: class_weight : 1
COMET INFO: coef0 : 0.0
COMET INFO: constant : 1
COMET INFO: copy : True
COMET INFO: criterion :
friedman_mse
COMET INFO: cv : 1
COMET INFO: decision_function_shape : ovr
COMET INFO: degree : 3
COMET INFO: drop : 1
COMET INFO: dtype : <class
'numpy.float64'>
COMET INFO: dual : False
COMET INFO: estimators :
[('random_forest', RandomForestClassifier(max_depth=20,
min_samples_leaf=2, min_samples_split=9,
n_estimators=295)), ('gradient_boosting',
GradientBoostingClassifier(learning_rate=0.11487120000946097,
max_depth=4,
n_estimators=113)), ('svm', SVC(C=1,
kernel='linear', probability=True))]
COMET INFO: final_estimator :
LogisticRegression()
COMET INFO: final_estimator__C : 1.0
COMET INFO: final_estimator__class_weight : 1
COMET INFO: final_estimator__dual : False
COMET INFO: final_estimator__fit_intercept : True
COMET INFO: final_estimator__intercept_scaling : 1
COMET INFO: final_estimator__l1_ratio : 1
28
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
29
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
30
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
31
https://www.linkedin.com/in/yennhi95zz/ || https://github.com/yennhi95zz || https://medium.com/@yennhi95zz
References:
• GitHub Repository
• Kaggle Project
• Medium Article
👏If you found this article interesting, your support by commenting your insights in this post will
help me spread the knowledge to others.
❗Found the articles helpful? Get UNLIMITED access to every story on Medium with just
$1/week — HERE
32