Professional Documents
Culture Documents
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import os
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
color = sns.color_palette()
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from IPython.display import Image
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import os # accessing directory structure
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, make_scorer
from scipy.stats import skew
import scipy.stats as stats
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, a
# Definitions
pd.set_option('display.float_format', lambda x: '%.3f' % x)
%matplotlib inline
#njobs = 4
pd.set_option('display.max_columns', 999)
def show_all(df):
with pd.option_context('display.max_rows',999,'display.max_columns',999):
display(df)
Data Cleaning
In [5]: churn.head()
Out[5]: RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance Nu
Let's Understand the columns present in our dataset and see which column have
least impact on Exiting the customer from bank and we will remove those
columns.
1. RowNumber : This column corresponds to the record (row) number and has no effect on
the output. This column will be removed.
2. CustomerId : This column contains random values and has no effect on customer leaving
the bank. This column will be removed.
3. Surname : This column contains the surname of a customer has no impact on their decision
to leave the bank. This column will be removed.
4. CreditScore : This column contains the Credit Score of a Customer. This can have an effect
on customer churn, since a customer with a higher credit score is less likely to leave the
bank.
5. Geography : This column contains location of a Customer. A customer's location can affect
their decision to leave the bank. We'll keep this column
6. Gender : This column contains the Gender information of each customer. It's interesting to
explore whether gender plays a role in a customer leaving the bank. We'll include this
column, too.
7. Age : This column contains Age of customers. This is certainly relevant, since older
customers are less likely to leave their bank than younger ones.
8. Tenure : This column refers to the number of years that the customer has been a client of
the bank. Normally, older clients are more loyal and less likely to leave a bank.
9. Balance : This column contains the Balance amount a customer has in his/her account. This
is also a very good indicator of customer churn, as people with a higher balance in their
accounts are less likely to leave the bank compared to those with lower balances.
10. NumOfProducts : This column refers to the number of products that a customer has
purchased through the bank.
11. HasCrCard : This column denotes whether or not a customer has a credit card. This column
is also relevant, since people with a credit card are less likely to leave the bank.
12. IsActiveMember : This column refers to if the customers are active members or not. Active
customers are less likely to leave the bank, so we'll keep this.
13. EstimatedSalary : This column contains the estimated salaries for the customers. As with
balance, people with lower salaries are more likely to leave the bank compared to those
with higher salaries.
14. Exited : This column contains data for whether or not the customer left the bank. This is
what we have to predict.
In [7]: churn.head()
Out[7]: CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMe
In [8]: churn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CreditScore 10000 non-null int64
1 Geography 10000 non-null object
2 Gender 10000 non-null object
3 Age 10000 non-null int64
4 Tenure 10000 non-null int64
5 Balance 10000 non-null float64
6 NumOfProducts 10000 non-null int64
7 HasCrCard 10000 non-null int64
8 IsActiveMember 10000 non-null int64
9 EstimatedSalary 10000 non-null float64
10 Exited 10000 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB
In [11]: churn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CreditScore 10000 non-null int64
1 Geography 10000 non-null category
2 Gender 10000 non-null category
3 Age 10000 non-null int64
4 Tenure 10000 non-null int64
5 Balance 10000 non-null float64
6 NumOfProducts 10000 non-null int64
7 HasCrCard 10000 non-null int64
8 IsActiveMember 10000 non-null int64
9 EstimatedSalary 10000 non-null float64
10 Exited 10000 non-null int64
dtypes: category(2), float64(2), int64(7)
memory usage: 723.0 KB
CreditScore 0
Out[12]:
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64
As we can see that there are no NULL values in any of the columns. So we can move ahead and
proceed for checking the Outliers
Outlier Analysis
In [13]: num_cols = []
for cols in churn.select_dtypes("int64"):
num_cols.append(cols)
In [15]: skew(churn['Age'])
1.0111685586628079
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x1f5483d8280>
Out[16]:
In [17]: sns.boxplot(churn['Age'])
C:\Users\lenovo\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureW
arning: Pass the following variable as a keyword arg: x. From version 0.12, th
e only valid positional argument will be `data`, and passing other arguments w
ithout an explicit keyword will result in an error or misinterpretation.
warnings.warn(
<AxesSubplot:xlabel='Age'>
Out[17]:
In [18]: churn['Age'].describe()
count 10000.000
Out[18]:
mean 38.922
std 10.488
min 18.000
25% 32.000
50% 37.000
75% 44.000
max 92.000
Name: Age, dtype: float64
From above Outlier Analysis we can conclude that we can move forward with the data and
continue our Analysis further.
We see that the data for Age Group is skewed and from the boxplot we can say that older
people tends churn which is not an issue.
Exploratory Analysis
Univariate Analysis
In [19]: churn.Exited.value_counts()
0 7963
Out[19]:
1 2037
Name: Exited, dtype: int64
Credit Score
In [20]: plt.figure(figsize=(10,6))
sns.histplot(churn['CreditScore'], kde=True)
plt.title('Credit Score')
plt.show()
From above plot we can conclude that the Credit Score of customers follow Normal DIstribution
approximately and there are no significant outliers or abnormalities
Geography
In [21]: plt.figure(figsize=(10,6))
sns.countplot(x='Geography', data=churn)
plt.title('Geography')
plt.show()
From above chart we can see that the data is for three different countries namely, France,
Germany and Spain. The distribution of customers is highest in France followed by Germany and
Spain
Gender
In [22]: plt.figure(figsize=(10, 6))
sns.countplot(x='Gender', data=churn)
plt.title('Gender')
plt.show()
From above plot we can conclude that, the dataset is fairly balanced distribution between Male
and Female customers, indicating there is no significant gender imbalance
Age
In [23]: plt.figure(figsize=(10, 6))
sns.histplot(churn['Age'])
plt.title('Age')
plt.show()
From above plot we can conclude that, the distribution of of customer ages appear to be
relatively uniform, with peak around the mid-30s to mid-40s. There are no significant outliers or
abnormalities observed.
Tenure
In [24]: plt.figure(figsize=(10, 6))
sns.histplot(churn['Tenure'])
plt.title('Tenure')
plt.show()
From above plot we can conclude that, the distribution of customers tenures indicates that the
majority of customers have been associated with the bank for relatively shorter period. There are
no significant outliers or abnormalities observed
Balance
In [25]: plt.figure(figsize=(10, 6))
sns.histplot(churn['Balance'], kde=True)
plt.title('Balance')
plt.show()
From above plot we can conclude that, The distribution of Account Balance is skewed towards
right, indicating that a significant portion of customers have relatively low balances. However,
there are also a few customers with high account balances, resulting in a long tail on right side
of the distribution.
NumOfProducts
HasCrCard
IsActiveMember
EstimatedSalary
Exited
Now we will perform Bivariate Analysis of all the varibales and see what more information we
can get from that
Bivariate Analysis
CreditScore vs Churn
Modelling
In [46]: # Selecting Independent and Dependent Variables
X = churn[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCa
y = churn['Exited']
In [49]: print(result.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Exited No. Observations: 10000
Model: Logit Df Residuals: 9992
Method: MLE Df Model: 7
Date: Mon, 12 Jun 2023 Pseudo R-squ.: 0.1286
Time: 20:29:50 Log-Likelihood: -4404.9
converged: True LL-Null: -5054.9
Covariance Type: nonrobust LLR p-value: 1.744e-276
==============================================================================
====
coef std err z P>|z| [0.025 0.
975]
------------------------------------------------------------------------------
----
const -3.7211 0.234 -15.914 0.000 -4.179 -
3.263
CreditScore -0.0006 0.000 -2.265 0.024 -0.001 -8.41
e-05
Age 0.0728 0.003 28.733 0.000 0.068
0.078
Tenure -0.0161 0.009 -1.747 0.081 -0.034
0.002
Balance 4.959e-06 4.57e-07 10.858 0.000 4.06e-06 5.85
e-06
NumOfProducts -0.0220 0.046 -0.477 0.634 -0.112
0.068
HasCrCard -0.0304 0.058 -0.521 0.602 -0.145
0.084
IsActiveMember -1.0864 0.057 -19.081 0.000 -1.198 -
0.975
==============================================================================
====
Conclusion
This logistic regression model suggests that Age, Balance, and IsActiveMember are important
factors in predicting customer churn. The other variables, such as CreditScore, Tenure,
NumOfProducts, and HasCrCard, do not show a statistically significant association with
customer churn in this model.
Conclusion
The adjusted model after scaling, dropping insignificant variables, and considering the
appropriate metrics shows that CreditScore, Age, Balance, and IsActiveMember are important
factors in predicting customer churn. The model's overall performance and the significance of
the variables have improved compared to the previous model.
In the given dataset, choosing Recall as the metric will align with the business goal of identifying
potential churners and implementing effective retention strategies to reduce customer churn
LogisticRegression()
Out[53]:
The precision curve shows that as the threshold decreases, precision tends to decrease. This
indicates that as we become more lenient in classifying instances as positive, the model is more
likely to produce false positives.
The recall curve demonstrates that as the threshold decreases, recall tends to increase. This
suggests that as we become more inclusive in classifying instances as positive, the model
captures a higher proportion of actual positive instances.
The plot allows you to identify an optimal threshold that balances precision and recall based on
the specific requirements of the business context. You can choose a threshold that achieves a
satisfactory trade-off between precision and recall, depending on the costs and consequences
associated with false positives and false negatives.
In [ ]: