You are on page 1of 23

TEAM MEMBERS

DETAILS:

Name: Ankita
Mohapatra
(22020343011)

Name: Ajitesh Das


(22020343006)

Name: Varun
Kumar Sinha
(22020343076)

Name: Sundar

TEAM: OMICRON
Bhattacharjee
(22020343070)

Name: Anshul
Group Assignment Aggarwal
(22020343013)
Defining the Problem

Given that a consumer credit card bank is facing the problem of customer
attrition. They want to analyse the data to find out the reason behind this and
leverage the same to predict customers who are likely to drop off.

Exploring the Data set

1. CLIENTNUM: It is the client number, a unique identifier for the customer


holding the account. Numeric data type and of no use in our analysis.

2. Attrition_Flag: It is the customer activity variable. There are two types of


customers in our data set: Attrited customers and Existing customers. Data type
is Categorical.

3. Customer_Age: It is a demographic variable which mentions customers’


ages. Data type is numeric.

4. Gender: It is another demographic variable. M denotes for male and F


denotes for female. Data type is categorical.

5. Dependent_count: It indicates the number of dependents of the


corresponding customer in his/her family.

6. Education_Level: It is the educational qualification of the account holder. It


is a categorical data where categories are: College, Doctorate, Graduate, High
School, Post-Graduate, Uneducated and Unknown.
7. Marital_Status: It indicates the marital status of the customers. It is a
categorical type data where categories are: Divorced, Married, Single and
Unknown.

8. Income_Category: It indicates the annual incomes of the customers in


particular slots. It is a categorical data types where slots are: < $40K, $40K -
60K, $60K - $80K, $80K-$120K, > $120K.

9. Card_Category: It denotes the types of credit cards customers use. It is a


categorical data type where types of cards are: Blue, Silver, Gold and Platinum.

10. Months_on_book: It denotes the period of relationship of the customers


with the bank in months. It is numerical data type and we have chosen this as
output variable in most of our analysis.

11. Total_Relationship_Count: It indicates the total number of products or


services acquired by the customer from the bank. Data type is numerical.

12. Months_Inacttive_12_mon: It denotes the number of months the customer


is inactive in using his/her credit card in the last 12 months. Data type is
numerical.

13. Contacts_Count_12_mon: It denotes the total number of transactions done


by the customers in the last 12 months. Data type is numerical.

14. Credit_Limit: It denotes the credit limit of the credit card. Data type is
numerical.

15. Total_Revolving_Bal: It denotes the total revolving balance of the card.


Data type is numerical.

16. Avg_Open_To_Buy: It denotes the average amount of last 12 months. Data


type is numerical.
17. Total_Amt_Chng_Q4_Q1: It denotes the change in transaction amount in
quarter 4 over quarter 1. Data type is numerical.

18. Total_Trans_Amt: It denotes the total transaction amount in last 12


months. Data type is numerical.

19. Total_Trans_Ct: It denotes the total number of transactions happened in


the last 12 months. Data type is numerical.

20. Total_Ct_Chng_Q4_Q1: It denotes the change in transaction count in


quarter 4 over quarter 1. Data type is numerical.

21. Avg_Utilization_Ratio: It denotes the average card utilisation ratio. Data


type is numerical. Data type is numerical.

22.Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Coun
t_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_
1: It denotes the simple probabilistic classifier (Naive Bayes classifier) with
strong (naive) independence assumptions between the mentioned variables.

23.Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Coun
t_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_
2: It denotes the simple probabilistic classifier (Naive Bayes classifier) with
strong (naive) independence assumptions between the mentioned variables.
Summary of the data set using R

Statistical Analysis

Chi-square Test:

1) Gender (Male or Female) and Card Category are independent variables

Null Hypothesis: “Gender” and “Card Category” are independent of each


other
Alternate Hypothesis: “Gender” and “Card Category” are dependent on
each other
Code:
tab1<-
table(credit_card_churn_updated$Gender,credit_card_churn_updated$Ca
rd_Category)
chisq.test(tab1)

Output:

As p-value<0.05, we reject null hypothesis.


“Gender” and “Card Category” are dependent on each other.

2) Education Level and Income Category are independent variables

Null Hypothesis: “Education level” and “Income category” are


independent of each other
Alternate Hypothesis: “Education Level” and “Income category” are
dependent on each other

Code:
tab2<table(credit_card_churn_updated$Education_Level,credit_card_chu
rn_updated$Income_Category)

Output:
As p-value<0.05, we reject null hypothesis.
“Education level” and “Income Category” are dependent on each
other.

3) Income Category and Card Category are independent variables

Null Hypothesis: “Income category” and “Card category” are independent


variables
Alternate Hypothesis: “Income category” and “Card category” are not
independent variables

Code:
tab4<table(credit_card_churn_updated$Income_Category,credit_card_ch
urn_updated$Card_Category)
chisq.test(tab4)

Output:

As p-value<0.05, we reject null hypothesis.


“Income category” and “Card Category” are dependent on each
other.
t-test:

1) Mean of “Months_On_Book” for Attrited Customers is equal to


mean of “Months_On_Book” for Existing customers

Null Hypothesis: Mean of “Months_On_Book” for Attritted


Customers is equal to the mean of “Months_On_Book” for Existing
Customers.
Alternate Hypothesis: Mean of “Months_On_Book” for Attritted
Customers is not equal to mean of “Months_On_Book” for Existing
Customers.

Code:
res<-t.test(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Attrition_Flag,data =
credit_card_churn_updated,var.equal = TRUE)
res

Output:

As p-value>0.05, we accept the null hypothesis.


Mean of “Months_On_Book” for Attritted Customers= Mean of
“Months_On_Book” for Existing Customers
2) Mean of “Months_On_Book” for males is equal to the mean of
“Months_On_Book” for females.

Null Hypothesis: Mean of “Months_On_Book” for males is equal to


the mean of “Months_On_Book” for females.
Alternate Hypothesis: Mean of “Months_On_Book” for males not
equal to mean of “Months_On_Book” for females.

Code:
res1<-t.test(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Gender,data =
credit_card_churn_updated,var.equal = TRUE)
res1

Output:

As p-value>0.05, we accept the null hypothesis.


Mean of “Months_On_Book” for males= Mean of
“Months_On_Book” for females
3) Mean of “Total_Relationships_Count” for Male is equal to the
mean of “Total_Relationships_Count” for Females.

Null Hypothesis: Mean of “Total_Relationships_Count” for male is


equal to the mean of “Total_Relationships_Count” for females.
Alternate Hypothesis: Mean of “Total_Relationships_Count” for male
is not equal to mean of “Total_Relationships_Count” for females.

Code:
res2<-
t.test(credit_card_churn_updated$Total_Relationship_Count~credit_c
ard_churn_updated$Gender,date = credit_card_churn_updated,
var.equal = TRUE)
res2

Output:

As p-value>0.05 we accept the null hypothesis.


Mean of “Total_Relationships_count” for Males = Mean of
“Total_Relationships_Count” for Females
4) Mean of “Credit_Limit” for males is equal to the mean of
“Credit_Limit” for females.

Null Hypothesis: Mean of “Credit_Limit” for males is equal to the


mean of “Credit_Limit” for females.
Alternate Hypothesis: Mean of “Credit_Limit” for males is not equal
to mean of “Credit_Limit” for females.

Code:
res3<-
t.test(credit_card_churn_updated$Credit_Limit~credit_card_churn_up
dated$Gender, data = credit_card_churn_updated)
res3

Output:

As, p-value<0.05 we reject Null Hypothesis.


Mean of “Credit_Limit” for males is not equal to mean of
“Credit_Limit” for females
5) Mean of “Total_Tran_Amt” for Attritted Customers is equal to
the mean of “Total_Tran_Amt” for Existing Customers.

Null Hypothesis: Mean of “Total_Tran_Amt” for Attritted customers


is equal to the mean of “Total_Tran_Amt” for Existing Customers.
Alternate Hypothesis: Mean of “Total_Tran_Amt” for Attritted
Customers is not equal to mean of “Total_Tran_Amt” for Existing
Customers.

Code:
res4<-
t.test(credit_card_churn_updated$Total_Trans_Amt~credit_card_chur
n_updated$Attrition_Flag,data = credit_card_churn_updated)
res4

Output:

As p-value<0.05 we reject Null Hypothesis.


Mean of “Total_Tran_Amt” for Attritted Customers is not equal
to Mean of “Total_Tran_Amt” for Existing customers
One-way ANOVA:

1) Mean of “Months_On_Book” for all Card categories is the


same

Null Hypothesis: Mean of “Months_On_Book” is same for all card


categories
Alternate Hypothesis: At least one of the mean of
“Months_On_Book” is not same

Code:
one.way<-aov(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Card_Category, data =
credit_card_churn_updated)
summary(one.way)

Output:

As p-value>0.05, we accept the null hypothesis.


Mean of “Months_On_Book” for all card categories is the same
2) Mean of “Months_Inactive_12_mon” for all card categories is
same

Null Hypothesis: Mean of “Months_Inactive_12_mon” is same for


all card categories.
Alternate Hypothesis: At least one of the mean of
“Months_Inactive_12_mon” is not same.

Code:
one.way1<-
aov(credit_card_churn_updated$Months_Inactive_12_mon~credit_
card_churn_updated$Card_Category, data =
credit_card_churn_updated)
summary(one.way1)

Output:

As p-value>0.05, we cannot reject null hypothesis


Mean of “Months_Inactive_12_mon” is same for all card
categories

3) Mean of “Months_on_book” is same for all education levels

Null Hypothesis: Mean of “Months_on_book” is same for all


education levels.
Alternate Hypothesis: At least one of the mean of
“Months_on_book” is not same for all education levels.

Code:
one.way2<-
aov(credit_card_churn_updated$Months_on_book~credit_card_ch
urn_updated$Education_Level, data = credit_card_churn_updated)
summary(one.way2)

Output:

As p-value>0.05, we accept the null hypothesis.


Mean of “Months_on_book” is same for all education levels.

4) Mean of “Credit_Limit” is same for different Income category

Null Hypothesis: Mean of “Credit_Limit” is same for all Income


categories.
Alternate Hypothesis: At least one of the mean of “Credit_Limit”
is not same for all Income categories.
Code:
one.way3<-
aov(credit_card_churn_updated$Credit_Limit~credit_card_churn_
updated$Income_Category, data = credit_card_churn_updated)
summary(one.way3)
Output:

As p-value<0.05, we can reject null hypothesis.


Mean of “Credit_Limit” is different for different Income
category

5) Mean of “Total_Trans_Amt” is same for all income categories

Null Hypothesis: Mean of “Total_Trans_Amt” is same for all


income categories.
Alternate Hypothesis: At least one of the mean of
“Total_Trans_Amt” is not same for all income categories.

Code:
one.way4<-
aov(credit_card_churn_updated$Total_Trans_Amt~credit_card_ch
urn_updated$Income_Category,data = credit_card_churn_updated)
summary(one.way4)
Output:

As p-value>0.05, we cannot reject null hypothesis.


Mean of “Total_Trans_Amt” is same for all Income categories

Correlation:

1) There is a positive corelation between “Customer_Age” and


“Months_On_Book”

Code:
cor.test(credit_card_churn_updated$Customer_Age,credit_card
_churn_updated$Months_on_book)

Output:

There is a strong positive corelation between


“Customer_Age” and “Months_On_Book”

2) There is a positive corelation between


“Months_Inactive_12_mon” and “Credit_Limit”

Code:
cor.test(credit_card_churn_updated$Months_Inactive_12_mon,
credit_card_churn_updated$Credit_Limit)
Output:

There is a very less negative corelation exists between


“Months_Inactive_12_mon” and “Credit_Limit”

3) There is a positive corelation between “Credit_Limit” and


“Months_on_book”

Code:
cor.test(credit_card_churn_updated$Credit_Limit,credit_card_c
hurn_updated$Months_on_book)

Output:

There is no correlation between “Credit_Limit” and


“Months_on_book”

Regression:
Assumptions: The following assumptions are made while
building the models.
1. The model is linear.
2. The error terms have constant variances.
3. The error terms are independent of each other.
4. The error terms are normally distributed.

Models:

1) “Months_Inactive_12_mon” is determined by
“Income_Category” and “Total Relationship count”

Code:
reg2<lm(credit_card_churn_updated$Months_Inactive_12_
mon~credit_card_churn_updated$Income_Category+credit_
card_churn_updated$Total_Relationship_Count,data =
credit_card_churn_updated)
summary(reg2)

Output:
R-square is less than 0 and p value>0.05.

This model is not the best fit model to predict relationship


between independent and dependent variables

2) “Months_on_book” can be predicted by


“Income_Category” and “Credit_Limit”

Code:
reg3<lm(credit_card_churn_updated$Months_on_book~cred
it_card_churn_updated$Income_Category+credit_card_chur
n_updated$Card_Category+credit_card_churn_updated$Cre
dit_Limit)
summary(reg3)

Output:

R-square is less than 0 hence it is not the best fit model


though p-value is less than 0.05

3) “Months_on_book” is determined by “Customer_Age”,


“Gender”, “Income_Category”, “Card_Category”,
“Months_Inactive_12_mon”, “Credit_Limit” and
“Total_Trans_Amt”

Code:
reg1<-lm(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Customer_Age+credit_card_chu
rn_updated$Gender+credit_card_churn_updated$Income_C
ategory+credit_card_churn_updated$Card_Category+credit_
card_churn_updated$Months_Inactive_12_mon+credit_card
_churn_updated$Credit_Limit+credit_card_churn_updated$
Total_Trans_Amt,data = credit_card_churn_updated)
summary(reg1)

Output:
R-square is between 0 to 1.
p-value<0.05
This model represents the best fit line model.

You might also like