Churn Prediction Model

In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import os
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
color = sns.color_palette()
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from IPython.display import Image
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import os # accessing directory structure
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, make_scorer
from scipy.stats import skew
import scipy.stats as stats
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, a
# Definitions
pd.set_option('display.float_format', lambda x: '%.3f' % x)
%matplotlib inline
#njobs = 4
In [2]: # changing pandas' option to display all columns
pd.set_option('display.max_columns', 999)
In [3]: # creating a function for ad hoc display of all rows
def show_all(df):
with pd.option_context('display.max_rows',999,'display.max_columns',999):
display(df)
In [4]: # Load the dataset

churn = pd.read_csv(r'C:\Users\lenovo\Downloads\Churn_Modelling.csv')
print("churn : " + str(churn.shape))
churn : (10000, 14)
Data Cleaning
In [5]: churn.head()
Out[5]: RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance Nu
0 1 15634602 Hargrave 619 France Female 42 2 0.000
1 2 15647311 Hill 608 Spain Female 41 1 83807.860
2 3 15619304 Onio 502 France Female 42 8 159660.800
3 4 15701354 Boni 699 France Female 39 1 0.000
4 5 15737888 Mitchell 850 Spain Female 43 2 125510.820
Let's Understand the columns present in our dataset and see which column have
least impact on Exiting the customer from bank and we will remove those
columns.
1. RowNumber : This column corresponds to the record (row) number and has no effect on
the output. This column will be removed.
2. CustomerId : This column contains random values and has no effect on customer leaving
the bank. This column will be removed.
3. Surname : This column contains the surname of a customer has no impact on their decision
to leave the bank. This column will be removed.
4. CreditScore : This column contains the Credit Score of a Customer. This can have an effect
on customer churn, since a customer with a higher credit score is less likely to leave the
bank.
5. Geography : This column contains location of a Customer. A customer's location can affect
their decision to leave the bank. We'll keep this column
6. Gender : This column contains the Gender information of each customer. It's interesting to
explore whether gender plays a role in a customer leaving the bank. We'll include this
column, too.
7. Age : This column contains Age of customers. This is certainly relevant, since older
customers are less likely to leave their bank than younger ones.
8. Tenure : This column refers to the number of years that the customer has been a client of
the bank. Normally, older clients are more loyal and less likely to leave a bank.
9. Balance : This column contains the Balance amount a customer has in his/her account. This
is also a very good indicator of customer churn, as people with a higher balance in their
accounts are less likely to leave the bank compared to those with lower balances.
10. NumOfProducts : This column refers to the number of products that a customer has
purchased through the bank.
11. HasCrCard : This column denotes whether or not a customer has a credit card. This column
is also relevant, since people with a credit card are less likely to leave the bank.
12. IsActiveMember : This column refers to if the customers are active members or not. Active
customers are less likely to leave the bank, so we'll keep this.
13. EstimatedSalary : This column contains the estimated salaries for the customers. As with
balance, people with lower salaries are more likely to leave the bank compared to those
with higher salaries.
14. Exited : This column contains data for whether or not the customer left the bank. This is
what we have to predict.
In [6]: # Dropping the irrelevant columns
churn = churn.drop(['RowNumber', 'CustomerId','Surname'], axis = 1)
In [7]: churn.head()
Out[7]: CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMe
0 619 France Female 42 2 0.000 1 1
1 608 Spain Female 41 1 83807.860 1 0
2 502 France Female 42 8 159660.800 3 1
3 699 France Female 39 1 0.000 2 0
4 850 Spain Female 43 2 125510.820 1 1
In [8]: churn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CreditScore 10000 non-null int64
1 Geography 10000 non-null object
2 Gender 10000 non-null object
3 Age 10000 non-null int64
4 Tenure 10000 non-null int64
5 Balance 10000 non-null float64
6 NumOfProducts 10000 non-null int64
7 HasCrCard 10000 non-null int64
8 IsActiveMember 10000 non-null int64
9 EstimatedSalary 10000 non-null float64
10 Exited 10000 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB
In [9]: # We will convert Object Type to Category data type

cat_cols = []
for cols in churn.select_dtypes("object"):

cat_cols.append(cols)
In [10]: churn[cat_cols] = churn[cat_cols].astype("category")
In [11]: churn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CreditScore 10000 non-null int64
1 Geography 10000 non-null category
2 Gender 10000 non-null category
3 Age 10000 non-null int64
4 Tenure 10000 non-null int64
5 Balance 10000 non-null float64
6 NumOfProducts 10000 non-null int64
7 HasCrCard 10000 non-null int64
8 IsActiveMember 10000 non-null int64
9 EstimatedSalary 10000 non-null float64
10 Exited 10000 non-null int64
dtypes: category(2), float64(2), int64(7)
memory usage: 723.0 KB
In [12]: # Checking for missing values

churn.isnull().sum()
CreditScore 0
Out[12]:
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64
As we can see that there are no NULL values in any of the columns. So we can move ahead and
proceed for checking the Outliers
Outlier Analysis
In [13]: num_cols = []
for cols in churn.select_dtypes("int64"):
num_cols.append(cols)
In [14]: sns.set(font_scale = 1.5)

fig = plt.figure(figsize=(24,60))
i = 1
for column in churn[num_cols]:
plt.subplot(13,2,i)
sns.boxplot(x = churn['Exited'], y = churn.loc[:,column])
i = i+1
plt.tight_layout()
plt.show()
In [15]: skew(churn['Age'])
1.0111685586628079
Out[15]:
In [16]: sns.displot(churn['Age'], kde = True)
<seaborn.axisgrid.FacetGrid at 0x1f5483d8280>
Out[16]:
In [17]: sns.boxplot(churn['Age'])
C:\Users\lenovo\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureW
arning: Pass the following variable as a keyword arg: x. From version 0.12, th
e only valid positional argument will be `data`, and passing other arguments w
ithout an explicit keyword will result in an error or misinterpretation.
warnings.warn(
<AxesSubplot:xlabel='Age'>
Out[17]:
In [18]: churn['Age'].describe()
count 10000.000
Out[18]:
mean 38.922
std 10.488
min 18.000
25% 32.000
50% 37.000
75% 44.000
max 92.000
Name: Age, dtype: float64
From above Outlier Analysis we can conclude that we can move forward with the data and
continue our Analysis further.
We see that the data for Age Group is skewed and from the boxplot we can say that older
people tends churn which is not an issue.
Exploratory Analysis
Univariate Analysis
In [19]: churn.Exited.value_counts()
0 7963
Out[19]:
1 2037
Name: Exited, dtype: int64
Credit Score
In [20]: plt.figure(figsize=(10,6))
sns.histplot(churn['CreditScore'], kde=True)
plt.title('Credit Score')
plt.show()
From above plot we can conclude that the Credit Score of customers follow Normal DIstribution
approximately and there are no significant outliers or abnormalities
Geography
In [21]: plt.figure(figsize=(10,6))
sns.countplot(x='Geography', data=churn)
plt.title('Geography')
plt.show()
From above chart we can see that the data is for three different countries namely, France,
Germany and Spain. The distribution of customers is highest in France followed by Germany and
Spain
Gender
In [22]: plt.figure(figsize=(10, 6))
sns.countplot(x='Gender', data=churn)
plt.title('Gender')
plt.show()
From above plot we can conclude that, the dataset is fairly balanced distribution between Male
and Female customers, indicating there is no significant gender imbalance
Age
sns.histplot(churn['Age'])
plt.title('Age')
plt.show()
From above plot we can conclude that, the distribution of of customer ages appear to be
relatively uniform, with peak around the mid-30s to mid-40s. There are no significant outliers or
abnormalities observed.
Tenure
sns.histplot(churn['Tenure'])
plt.title('Tenure')
plt.show()
From above plot we can conclude that, the distribution of customers tenures indicates that the
majority of customers have been associated with the bank for relatively shorter period. There are
no significant outliers or abnormalities observed
Balance
sns.histplot(churn['Balance'], kde=True)
plt.title('Balance')
plt.show()
From above plot we can conclude that, The distribution of Account Balance is skewed towards
right, indicating that a significant portion of customers have relatively low balances. However,
there are also a few customers with high account balances, resulting in a long tail on right side
of the distribution.
NumOfProducts

sns.countplot(x='NumOfProducts', data=churn)
plt.title('Number of Products')
plt.show()
Analyzing the bar chart, we can conclude that the majority of customers hold only one or two
bank products. The number of customers decreases as the number of products increases,
indicating that fewer customers hold three or four products.
HasCrCard

sns.countplot(x='HasCrCard', data=churn)
plt.title('Has Credit Card')
plt.show()
Analyzing the bar chart, we can conclude that the dataset has a high proportion of customers
with a credit card. There is a slightly higher count of customers with a credit card compared to
those without a credit card.
IsActiveMember

sns.countplot(x='IsActiveMember', data=churn)
plt.title('Active Member')
plt.show()
The dataset has a fairly balanced distribution between active and inactive members. There is no
significant imbalance between the two groups.
EstimatedSalary

sns.histplot(churn['EstimatedSalary'], kde = True)
plt.title('Estimated Salary')
plt.show()
The distribution of estimated salaries appears to be relatively uniform, with no significant
outliers or abnormalities observed.
Exited

sns.countplot(x='Exited', data=churn)
plt.title('Churn Status')
plt.show()
The dataset has a class imbalance in terms of churn status. The number of non-churned
customers is significantly higher than the number of churned customers, indicating that churn is
relatively infrequent in the dataset.
Now we will perform Bivariate Analysis of all the varibales and see what more information we
can get from that
Bivariate Analysis
CreditScore vs Churn

sns.boxplot(x='Exited', y='CreditScore', data=churn)
plt.title('Credit Score vs. Churn')
plt.show()
The boxplot shows that there is no significant difference in credit scores between churned and
non-churned customers. Both groups have similar credit score distributions
Geography vs. Churn

sns.countplot(x='Geography', hue='Exited', data=churn)
plt.title('Geography vs. Churn')
plt.show()
The countplot indicates that customers from Germany and France have a relatively higher churn
rate compared to those from Spain. Germany has the highest churn rate among the three
countries.
Gender vs. Churn

sns.countplot(x='Gender', hue='Exited', data=churn)
plt.title('Gender vs. Churn')
plt.show()
The countplot shows that the churn rate is slightly higher for female customers compared to
male customers. However, the difference is not significant.
Age vs. Churn

sns.boxplot(x='Exited', y='Age', data=churn)
plt.title('Age vs. Churn')
plt.show()
The boxplot suggests that older customers tend to have a slightly higher churn rate compared
to younger customers. However, the difference is not substantial.
Tenure vs. Churn

sns.boxplot(x='Exited', y='Tenure', data=churn)
plt.title('Tenure vs. Churn')
plt.show()
The boxplot indicates that there is no significant difference in tenure between churned and non-
churned customers. Both groups have similar tenure distributions.
Balance vs. Churn

sns.boxplot(x='Exited', y='Balance', data=churn)
plt.title('Balance vs. Churn')
plt.show()
The boxplot shows that customers with higher balances have a slightly lower churn rate
compared to those with lower balances. However, the difference is not substantial.
NumOfProducts vs. Churn

sns.countplot(x='NumOfProducts', hue='Exited', data=churn)
plt.title('Number of Products vs. Churn')
plt.show()
The countplot suggests that customers with either one or four products have a higher churn
rate compared to those with two or three products. Customers with three products have the
lowest churn rate
HasCrCard vs. Churn

sns.countplot(x='HasCrCard', hue='Exited', data=churn)
plt.title('Has Credit Card vs. Churn')
plt.show()
The countplot indicates that the presence or absence of a credit card does not have a significant
impact on the churn rate. The churn rates are similar for customers with or without a credit card.
IsActiveMember vs. Churn

sns.countplot(x='IsActiveMember', hue='Exited', data=churn)
plt.title('Active Member vs. Churn')
plt.show()
The countplot suggests that non-active members have a higher churn rate compared to active
members. Being an active member seems to be associated with a lower likelihood of churn.
EstimatedSalary vs. Churn

sns.boxplot(x='Exited', y='EstimatedSalary', data=churn)
plt.title('Estimated Salary vs. Churn')
plt.show()
The boxplot shows that there is no significant difference in estimated salaries between churned
and non-churned customers. Both groups have similar estimated salary distributions.
Chi Square test for all Categorical Variables

In [41]: # List of categorical variables
categorical_vars = ['Geography', 'Gender', 'NumOfProducts', 'HasCrCard', 'IsAct
In [42]: # Perform chi-square test for each categorical variable

for var in categorical_vars:
contingency_table = pd.crosstab(churn[var], churn['Exited'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
# Print the results

print(f'Chi-square test for {var} vs. Exited')
print('-----------------------')
print('Contingency Table:')
print(contingency_table)
print('-----------------------')
print(f'Chi-square statistic: {chi2}')
print(f'p-value: {p_value}')
print('-----------------------\n')
Chi-square test for Geography vs. Exited
-----------------------
Contingency Table:
Exited 0 1
Geography
France 4204 810
Germany 1695 814
Spain 2064 413
-----------------------
Chi-square statistic: 301.25533682434536
p-value: 3.8303176053541544e-66
-----------------------
Chi-square test for Gender vs. Exited

-----------------------
Contingency Table:
Exited 0 1
Gender
Female 3404 1139
Male 4559 898
-----------------------
p-value: 2.2482100097131755e-26
-----------------------
Chi-square test for NumOfProducts vs. Exited

-----------------------
Contingency Table:
Exited 0 1
NumOfProducts
1 3675 1409
2 4242 348
3 46 220
4 0 60
-----------------------
p-value: 0.0
-----------------------
Chi-square test for HasCrCard vs. Exited

-----------------------
Contingency Table:
Exited 0 1
HasCrCard
0 2332 613
1 5631 1424
-----------------------
p-value: 0.49237236141554686
-----------------------
Chi-square test for IsActiveMember vs. Exited

-----------------------
Contingency Table:
Exited 0 1
IsActiveMember
0 3547 1302
1 4416 735
-----------------------
p-value: 8.785858269303703e-55
-----------------------
In [43]: # List of Numerical Columns

numeric_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary'
In [44]: # Defining the Bin function
def bin_numeric_column(column, num_bins):

_, bins = pd.qcut(column, num_bins, retbins=True, duplicates='drop')
return pd.cut(column, bins=bins, labels=False, include_lowest=True)
num_bins = 5 # Number of bins for binning

for col in numeric_columns:
churn[col + '_binned'] = bin_numeric_column(churn[col], num_bins)
In [45]: # Performing the Chi Square test
binned_columns = [col + '_binned' for col in numeric_columns]

for col in binned_columns:
contingency_table = pd.crosstab(churn[col], churn['Exited'])
chi2, p_value, _, _ = stats.chi2_contingency(contingency_table)
# Print the results

print(f'Chi-square test for {col} vs. Exited')
print('-----------------------')
print('Contingency Table:')
print(contingency_table)
print('-----------------------')
print(f'Chi-square statistic: {chi2}')
print(f'p-value: {p_value}')
print('-----------------------\n')
Chi-square test for CreditScore_binned vs. Exited
-----------------------
Contingency Table:
Exited 0 1
CreditScore_binned
0 1558 452
1 1599 421
2 1615 395
3 1618 363
4 1573 406
-----------------------
p-value: 0.020494843737859876
-----------------------
Chi-square test for Age_binned vs. Exited

-----------------------
Contingency Table:
Exited 0 1
Age_binned
0 2191 181
1 1615 166
2 1927 339
3 1211 485
4 1019 866
-----------------------
p-value: 7.86362901772543e-268
-----------------------
Chi-square test for Tenure_binned vs. Exited

-----------------------
Contingency Table:
Exited 0 1
Tenure_binned
0 1968 528
1 1582 416
2 1574 405
3 1679 374
4 1160 314
-----------------------
p-value: 0.09674012551068842
-----------------------
Chi-square test for Balance_binned vs. Exited

-----------------------
Contingency Table:
Exited 0 1
Balance_binned
0 3410 590
1 1554 446
2 1461 539
3 1538 462
-----------------------
p-value: 3.0739103340650463e-31
-----------------------
Chi-square test for EstimatedSalary_binned vs. Exited

-----------------------
Contingency Table:
Exited 0 1
EstimatedSalary_binned
0 1601 399
1 1601 399
2 1596 404
3 1596 404
4 1569 431
-----------------------
p-value: 0.6948034012000351
-----------------------
Modelling
In [46]: # Selecting Independent and Dependent Variables
X = churn[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCa
y = churn['Exited']
In [47]: # Adding a constant term to the independent variables

X = sm.add_constant(X)
In [48]: # Fitting the logistic regression model using statsmodels

logit_model = sm.Logit(y, X)
result = logit_model.fit()
Optimization terminated successfully.

Current function value: 0.440494
Iterations 6
In [49]: print(result.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Exited No. Observations: 10000
Model: Logit Df Residuals: 9992
Method: MLE Df Model: 7
Date: Mon, 12 Jun 2023 Pseudo R-squ.: 0.1286
Time: 20:29:50 Log-Likelihood: -4404.9
converged: True LL-Null: -5054.9
Covariance Type: nonrobust LLR p-value: 1.744e-276
==============================================================================
====
coef std err z P>|z| [0.025 0.
975]
------------------------------------------------------------------------------
----
const -3.7211 0.234 -15.914 0.000 -4.179 -
3.263
CreditScore -0.0006 0.000 -2.265 0.024 -0.001 -8.41
e-05
Age 0.0728 0.003 28.733 0.000 0.068
0.078
Tenure -0.0161 0.009 -1.747 0.081 -0.034
0.002
Balance 4.959e-06 4.57e-07 10.858 0.000 4.06e-06 5.85
e-06
NumOfProducts -0.0220 0.046 -0.477 0.634 -0.112
0.068
HasCrCard -0.0304 0.058 -0.521 0.602 -0.145
0.084
IsActiveMember -1.0864 0.057 -19.081 0.000 -1.198 -
0.975
==============================================================================
====
Conclusion
This logistic regression model suggests that Age, Balance, and IsActiveMember are important
factors in predicting customer churn. The other variables, such as CreditScore, Tenure,
NumOfProducts, and HasCrCard, do not show a statistically significant association with
customer churn in this model.
In [50]: # Selecting the relevant features

X = churn[['CreditScore', 'Age', 'Balance', 'IsActiveMember']]
# Scaling the numeric features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Adding a constant term to the independent variables

X_scaled = sm.add_constant(X_scaled)
# Fitting the logistic regression model

logit_model = sm.Logit(y, X_scaled)
result = logit_model.fit()
# Print the summary of the model

print(result.summary())
Optimization terminated successfully.
Current function value: 0.440675
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: Exited No. Observations: 10000
Model: Logit Df Residuals: 9995
Method: MLE Df Model: 4
Date: Mon, 12 Jun 2023 Pseudo R-squ.: 0.1282
Time: 20:29:50 Log-Likelihood: -4406.7
converged: True LL-Null: -5054.9
Covariance Type: nonrobust LLR p-value: 2.121e-279
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.6075 0.030 -53.581 0.000 -1.666 -1.549
x1 -0.0604 0.027 -2.268 0.023 -0.113 -0.008
x2 0.7641 0.027 28.756 0.000 0.712 0.816
x3 0.3130 0.028 11.376 0.000 0.259 0.367
x4 -0.5413 0.028 -19.043 0.000 -0.597 -0.486
==============================================================================
Conclusion
The adjusted model after scaling, dropping insignificant variables, and considering the
appropriate metrics shows that CreditScore, Age, Balance, and IsActiveMember are important
factors in predicting customer churn. The model's overall performance and the significance of
the variables have improved compared to the previous model.
In the given dataset, choosing Recall as the metric will align with the business goal of identifying
potential churners and implementing effective retention strategies to reduce customer churn
In [51]: # Selecting the relevant features

X = churn[['CreditScore', 'Age', 'Balance', 'IsActiveMember']]
In [52]: # Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random
In [53]: # Train the logistic regression model

model = LogisticRegression()
model.fit(X_train, y_train)
LogisticRegression()
Out[53]:
In [54]: # Obtain predicted probabilities for the positive class

y_probs = model.predict_proba(X_test)[:, 1]
In [55]: # Calculate ROC-AUC score

roc_auc = roc_auc_score(y_test, y_probs)
In [56]: # Calculate precision, recall, and threshold values

precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
In [57]: # Plot ROC curve

fpr, tpr, _ = roc_curve(y_test, y_probs)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()
In [58]: # Plot precision, recall, and threshold values

plt.figure(figsize=(8, 6))
plt.plot(thresholds, precision[:-1], "b-", label="Precision")
plt.plot(thresholds, recall[:-1], "r-", label="Recall")
plt.xlabel("Threshold")
plt.ylabel("Value")
plt.title("Precision, Recall, and Threshold Values")
plt.legend(loc="best")
plt.grid(True)
plt.show()
Conclusion:
The precision curve shows that as the threshold decreases, precision tends to decrease. This
indicates that as we become more lenient in classifying instances as positive, the model is more
likely to produce false positives.
The recall curve demonstrates that as the threshold decreases, recall tends to increase. This
suggests that as we become more inclusive in classifying instances as positive, the model
captures a higher proportion of actual positive instances.
The plot allows you to identify an optimal threshold that balances precision and recall based on
the specific requirements of the business context. You can choose a threshold that achieves a
satisfactory trade-off between precision and recall, depending on the costs and consequences
associated with false positives and false negatives.
In [ ]:

Churn Prediction Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Churn Prediction Model

Uploaded by

Copyright:

Available Formats

In [1]: import numpy as np

In [2]: # changing pandas' option to display all columns

In [3]: # creating a function for ad hoc display of all rows

In [4]: # Load the dataset

churn : (10000, 14)

0 1 15634602 Hargrave 619 France Female 42 2 0.000

1 2 15647311 Hill 608 Spain Female 41 1 83807.860

2 3 15619304 Onio 502 France Female 42 8 159660.800

3 4 15701354 Boni 699 France Female 39 1 0.000

4 5 15737888 Mitchell 850 Spain Female 43 2 125510.820

In [6]: # Dropping the irrelevant columns

churn = churn.drop(['RowNumber', 'CustomerId','Surname'], axis = 1)

0 619 France Female 42 2 0.000 1 1

1 608 Spain Female 41 1 83807.860 1 0

2 502 France Female 42 8 159660.800 3 1

3 699 France Female 39 1 0.000 2 0

4 850 Spain Female 43 2 125510.820 1 1

In [9]: # We will convert Object Type to Category data type

for cols in churn.select_dtypes("object"):

In [12]: # Checking for missing values

In [14]: sns.set(font_scale = 1.5)

In [16]: sns.displot(churn['Age'], kde = True)

In [26]: plt.figure(figsize=(10, 6))

In [27]: plt.figure(figsize=(10, 6))

In [28]: plt.figure(figsize=(10, 6))

In [29]: plt.figure(figsize=(10, 6))

In [30]: plt.figure(figsize=(10, 6))

In [31]: plt.figure(figsize=(10, 6))

Geography vs. Churn

In [32]: plt.figure(figsize=(10, 6))

Gender vs. Churn

In [33]: plt.figure(figsize=(10, 6))

Age vs. Churn

In [34]: plt.figure(figsize=(10, 6))

Tenure vs. Churn

In [35]: plt.figure(figsize=(10, 6))

Balance vs. Churn

In [36]: plt.figure(figsize=(10, 6))

NumOfProducts vs. Churn

In [37]: plt.figure(figsize=(10, 6))

HasCrCard vs. Churn

In [38]: plt.figure(figsize=(10, 6))

IsActiveMember vs. Churn

In [39]: plt.figure(figsize=(10, 6))

EstimatedSalary vs. Churn

In [40]: plt.figure(figsize=(10, 6))

Chi Square test for all Categorical Variables

In [42]: # Perform chi-square test for each categorical variable

# Print the results

Chi-square test for Gender vs. Exited

Chi-square test for NumOfProducts vs. Exited

Chi-square test for HasCrCard vs. Exited

Chi-square test for IsActiveMember vs. Exited

In [43]: # List of Numerical Columns

In [44]: # Defining the Bin function

def bin_numeric_column(column, num_bins):

num_bins = 5 # Number of bins for binning

In [45]: # Performing the Chi Square test

binned_columns = [col + '_binned' for col in numeric_columns]

# Print the results

Chi-square test for Age_binned vs. Exited

Chi-square test for Tenure_binned vs. Exited

Chi-square test for Balance_binned vs. Exited

Chi-square test for EstimatedSalary_binned vs. Exited