Omicron

TEAM MEMBERS
DETAILS:
Name: Ankita
Mohapatra
(22020343011)
Name: Ajitesh Das

(22020343006)
Name: Varun
Kumar Sinha
(22020343076)
Name: Sundar
TEAM: OMICRON
Bhattacharjee
(22020343070)
Name: Anshul
Group Assignment Aggarwal
(22020343013)
Defining the Problem
Given that a consumer credit card bank is facing the problem of customer
attrition. They want to analyse the data to find out the reason behind this and
leverage the same to predict customers who are likely to drop off.
Exploring the Data set
1. CLIENTNUM: It is the client number, a unique identifier for the customer

holding the account. Numeric data type and of no use in our analysis.
2. Attrition_Flag: It is the customer activity variable. There are two types of

customers in our data set: Attrited customers and Existing customers. Data type
is Categorical.
3. Customer_Age: It is a demographic variable which mentions customers’

ages. Data type is numeric.
4. Gender: It is another demographic variable. M denotes for male and F

denotes for female. Data type is categorical.
5. Dependent_count: It indicates the number of dependents of the

corresponding customer in his/her family.
6. Education_Level: It is the educational qualification of the account holder. It

is a categorical data where categories are: College, Doctorate, Graduate, High
School, Post-Graduate, Uneducated and Unknown.
7. Marital_Status: It indicates the marital status of the customers. It is a
categorical type data where categories are: Divorced, Married, Single and
Unknown.
8. Income_Category: It indicates the annual incomes of the customers in

particular slots. It is a categorical data types where slots are: < $40K, $40K -
60K, $60K - $80K, $80K-$120K, > $120K.
9. Card_Category: It denotes the types of credit cards customers use. It is a

categorical data type where types of cards are: Blue, Silver, Gold and Platinum.
10. Months_on_book: It denotes the period of relationship of the customers

with the bank in months. It is numerical data type and we have chosen this as
output variable in most of our analysis.
11. Total_Relationship_Count: It indicates the total number of products or

services acquired by the customer from the bank. Data type is numerical.
12. Months_Inacttive_12_mon: It denotes the number of months the customer

is inactive in using his/her credit card in the last 12 months. Data type is
numerical.
13. Contacts_Count_12_mon: It denotes the total number of transactions done

by the customers in the last 12 months. Data type is numerical.
14. Credit_Limit: It denotes the credit limit of the credit card. Data type is
numerical.
15. Total_Revolving_Bal: It denotes the total revolving balance of the card.

Data type is numerical.
16. Avg_Open_To_Buy: It denotes the average amount of last 12 months. Data

type is numerical.
17. Total_Amt_Chng_Q4_Q1: It denotes the change in transaction amount in
quarter 4 over quarter 1. Data type is numerical.
18. Total_Trans_Amt: It denotes the total transaction amount in last 12

months. Data type is numerical.
19. Total_Trans_Ct: It denotes the total number of transactions happened in

the last 12 months. Data type is numerical.
20. Total_Ct_Chng_Q4_Q1: It denotes the change in transaction count in

quarter 4 over quarter 1. Data type is numerical.
21. Avg_Utilization_Ratio: It denotes the average card utilisation ratio. Data

type is numerical. Data type is numerical.
22.Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Coun
t_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_
1: It denotes the simple probabilistic classifier (Naive Bayes classifier) with
strong (naive) independence assumptions between the mentioned variables.
23.Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Coun
t_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_
2: It denotes the simple probabilistic classifier (Naive Bayes classifier) with
strong (naive) independence assumptions between the mentioned variables.
Summary of the data set using R
Statistical Analysis
Chi-square Test:
1) Gender (Male or Female) and Card Category are independent variables
Null Hypothesis: “Gender” and “Card Category” are independent of each

other
Alternate Hypothesis: “Gender” and “Card Category” are dependent on
each other
Code:
tab1<-
table(credit_card_churn_updated$Gender,credit_card_churn_updated$Ca
rd_Category)
chisq.test(tab1)
Output:
As p-value<0.05, we reject null hypothesis.

“Gender” and “Card Category” are dependent on each other.
2) Education Level and Income Category are independent variables
Null Hypothesis: “Education level” and “Income category” are

independent of each other
Alternate Hypothesis: “Education Level” and “Income category” are
dependent on each other
Code:
tab2<table(credit_card_churn_updated$Education_Level,credit_card_chu
rn_updated$Income_Category)
Output:
“Education level” and “Income Category” are dependent on each
other.
3) Income Category and Card Category are independent variables
Null Hypothesis: “Income category” and “Card category” are independent

variables
Alternate Hypothesis: “Income category” and “Card category” are not
independent variables
Code:
tab4<table(credit_card_churn_updated$Income_Category,credit_card_ch
urn_updated$Card_Category)
chisq.test(tab4)
Output:

“Income category” and “Card Category” are dependent on each
other.
t-test:
1) Mean of “Months_On_Book” for Attrited Customers is equal to

mean of “Months_On_Book” for Existing customers
Null Hypothesis: Mean of “Months_On_Book” for Attritted

Customers is equal to the mean of “Months_On_Book” for Existing
Customers.
Alternate Hypothesis: Mean of “Months_On_Book” for Attritted
Customers is not equal to mean of “Months_On_Book” for Existing
Customers.
Code:
res<-t.test(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Attrition_Flag,data =
credit_card_churn_updated,var.equal = TRUE)
res
Output:
As p-value>0.05, we accept the null hypothesis.

Mean of “Months_On_Book” for Attritted Customers= Mean of
“Months_On_Book” for Existing Customers
2) Mean of “Months_On_Book” for males is equal to the mean of
“Months_On_Book” for females.
Null Hypothesis: Mean of “Months_On_Book” for males is equal to

the mean of “Months_On_Book” for females.
Alternate Hypothesis: Mean of “Months_On_Book” for males not
equal to mean of “Months_On_Book” for females.
Code:
res1<-t.test(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Gender,data =
credit_card_churn_updated,var.equal = TRUE)
res1
Output:

Mean of “Months_On_Book” for males= Mean of
“Months_On_Book” for females
3) Mean of “Total_Relationships_Count” for Male is equal to the
mean of “Total_Relationships_Count” for Females.
Null Hypothesis: Mean of “Total_Relationships_Count” for male is

equal to the mean of “Total_Relationships_Count” for females.
Alternate Hypothesis: Mean of “Total_Relationships_Count” for male
is not equal to mean of “Total_Relationships_Count” for females.
Code:
res2<-
t.test(credit_card_churn_updated$Total_Relationship_Count~credit_c
ard_churn_updated$Gender,date = credit_card_churn_updated,
var.equal = TRUE)
res2
Output:
As p-value>0.05 we accept the null hypothesis.

Mean of “Total_Relationships_count” for Males = Mean of
“Total_Relationships_Count” for Females
4) Mean of “Credit_Limit” for males is equal to the mean of
“Credit_Limit” for females.
Null Hypothesis: Mean of “Credit_Limit” for males is equal to the

mean of “Credit_Limit” for females.
Alternate Hypothesis: Mean of “Credit_Limit” for males is not equal
to mean of “Credit_Limit” for females.
Code:
res3<-
t.test(credit_card_churn_updated$Credit_Limit~credit_card_churn_up
dated$Gender, data = credit_card_churn_updated)
res3
Output:
As, p-value<0.05 we reject Null Hypothesis.

Mean of “Credit_Limit” for males is not equal to mean of
“Credit_Limit” for females
5) Mean of “Total_Tran_Amt” for Attritted Customers is equal to
the mean of “Total_Tran_Amt” for Existing Customers.
Null Hypothesis: Mean of “Total_Tran_Amt” for Attritted customers

is equal to the mean of “Total_Tran_Amt” for Existing Customers.
Alternate Hypothesis: Mean of “Total_Tran_Amt” for Attritted
Customers is not equal to mean of “Total_Tran_Amt” for Existing
Customers.
Code:
res4<-
t.test(credit_card_churn_updated$Total_Trans_Amt~credit_card_chur
n_updated$Attrition_Flag,data = credit_card_churn_updated)
res4
Output:
As p-value<0.05 we reject Null Hypothesis.

Mean of “Total_Tran_Amt” for Attritted Customers is not equal
to Mean of “Total_Tran_Amt” for Existing customers
One-way ANOVA:
1) Mean of “Months_On_Book” for all Card categories is the

same
Null Hypothesis: Mean of “Months_On_Book” is same for all card

categories
Alternate Hypothesis: At least one of the mean of
“Months_On_Book” is not same
Code:
one.way<-aov(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Card_Category, data =
credit_card_churn_updated)
summary(one.way)
Output:

Mean of “Months_On_Book” for all card categories is the same
2) Mean of “Months_Inactive_12_mon” for all card categories is
same
Null Hypothesis: Mean of “Months_Inactive_12_mon” is same for

all card categories.
“Months_Inactive_12_mon” is not same.
Code:
one.way1<-
aov(credit_card_churn_updated$Months_Inactive_12_mon~credit_
card_churn_updated$Card_Category, data =
summary(one.way1)
Output:
As p-value>0.05, we cannot reject null hypothesis

Mean of “Months_Inactive_12_mon” is same for all card
categories
3) Mean of “Months_on_book” is same for all education levels
Null Hypothesis: Mean of “Months_on_book” is same for all

education levels.
“Months_on_book” is not same for all education levels.
Code:
one.way2<-
aov(credit_card_churn_updated$Months_on_book~credit_card_ch
urn_updated$Education_Level, data = credit_card_churn_updated)
summary(one.way2)
Output:

Mean of “Months_on_book” is same for all education levels.
4) Mean of “Credit_Limit” is same for different Income category
Null Hypothesis: Mean of “Credit_Limit” is same for all Income

categories.
Alternate Hypothesis: At least one of the mean of “Credit_Limit”
is not same for all Income categories.
Code:
one.way3<-
aov(credit_card_churn_updated$Credit_Limit~credit_card_churn_
updated$Income_Category, data = credit_card_churn_updated)
summary(one.way3)
Output:
As p-value<0.05, we can reject null hypothesis.

Mean of “Credit_Limit” is different for different Income
category
5) Mean of “Total_Trans_Amt” is same for all income categories
Null Hypothesis: Mean of “Total_Trans_Amt” is same for all

income categories.
“Total_Trans_Amt” is not same for all income categories.
Code:
one.way4<-
aov(credit_card_churn_updated$Total_Trans_Amt~credit_card_ch
urn_updated$Income_Category,data = credit_card_churn_updated)
summary(one.way4)
Output:
As p-value>0.05, we cannot reject null hypothesis.

Mean of “Total_Trans_Amt” is same for all Income categories
Correlation:
1) There is a positive corelation between “Customer_Age” and

“Months_On_Book”
Code:
cor.test(credit_card_churn_updated$Customer_Age,credit_card
_churn_updated$Months_on_book)
Output:
There is a strong positive corelation between

“Customer_Age” and “Months_On_Book”
2) There is a positive corelation between

“Months_Inactive_12_mon” and “Credit_Limit”
Code:
cor.test(credit_card_churn_updated$Months_Inactive_12_mon,
credit_card_churn_updated$Credit_Limit)
Output:
There is a very less negative corelation exists between

“Months_Inactive_12_mon” and “Credit_Limit”
3) There is a positive corelation between “Credit_Limit” and

“Months_on_book”
Code:
cor.test(credit_card_churn_updated$Credit_Limit,credit_card_c
hurn_updated$Months_on_book)
Output:
There is no correlation between “Credit_Limit” and

“Months_on_book”
Regression:
Assumptions: The following assumptions are made while
building the models.
1. The model is linear.
2. The error terms have constant variances.
3. The error terms are independent of each other.
4. The error terms are normally distributed.
Models:
1) “Months_Inactive_12_mon” is determined by
“Income_Category” and “Total Relationship count”
Code:
reg2<lm(credit_card_churn_updated$Months_Inactive_12_
mon~credit_card_churn_updated$Income_Category+credit_
card_churn_updated$Total_Relationship_Count,data =
summary(reg2)
Output:
R-square is less than 0 and p value>0.05.
This model is not the best fit model to predict relationship

between independent and dependent variables
2) “Months_on_book” can be predicted by

“Income_Category” and “Credit_Limit”
Code:
reg3<lm(credit_card_churn_updated$Months_on_book~cred
it_card_churn_updated$Income_Category+credit_card_chur
n_updated$Card_Category+credit_card_churn_updated$Cre
dit_Limit)
summary(reg3)
Output:
R-square is less than 0 hence it is not the best fit model

though p-value is less than 0.05
3) “Months_on_book” is determined by “Customer_Age”,

“Gender”, “Income_Category”, “Card_Category”,
“Months_Inactive_12_mon”, “Credit_Limit” and
“Total_Trans_Amt”
Code:
reg1<-lm(credit_card_churn_updated$Months_on_book ~
credit_card_churn_updated$Customer_Age+credit_card_chu
rn_updated$Gender+credit_card_churn_updated$Income_C
ategory+credit_card_churn_updated$Card_Category+credit_
card_churn_updated$Months_Inactive_12_mon+credit_card
_churn_updated$Credit_Limit+credit_card_churn_updated$
Total_Trans_Amt,data = credit_card_churn_updated)
summary(reg1)
Output:
R-square is between 0 to 1.
p-value<0.05
This model represents the best fit line model.

Omicron

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Omicron

Uploaded by

Copyright:

Available Formats

TEAM MEMBERS

Name: Ajitesh Das

Exploring the Data set

1. CLIENTNUM: It is the client number, a unique identifier for the customer

2. Attrition_Flag: It is the customer activity variable. There are two types of

3. Customer_Age: It is a demographic variable which mentions customers’

4. Gender: It is another demographic variable. M denotes for male and F

5. Dependent_count: It indicates the number of dependents of the

6. Education_Level: It is the educational qualification of the account holder. It

8. Income_Category: It indicates the annual incomes of the customers in

9. Card_Category: It denotes the types of credit cards customers use. It is a

10. Months_on_book: It denotes the period of relationship of the customers

11. Total_Relationship_Count: It indicates the total number of products or

12. Months_Inacttive_12_mon: It denotes the number of months the customer

13. Contacts_Count_12_mon: It denotes the total number of transactions done

15. Total_Revolving_Bal: It denotes the total revolving balance of the card.

16. Avg_Open_To_Buy: It denotes the average amount of last 12 months. Data

18. Total_Trans_Amt: It denotes the total transaction amount in last 12

19. Total_Trans_Ct: It denotes the total number of transactions happened in

20. Total_Ct_Chng_Q4_Q1: It denotes the change in transaction count in

21. Avg_Utilization_Ratio: It denotes the average card utilisation ratio. Data

1) Gender (Male or Female) and Card Category are independent variables

Null Hypothesis: “Gender” and “Card Category” are independent of each

As p-value<0.05, we reject null hypothesis.

2) Education Level and Income Category are independent variables

Null Hypothesis: “Education level” and “Income category” are

3) Income Category and Card Category are independent variables

Null Hypothesis: “Income category” and “Card category” are independent

As p-value<0.05, we reject null hypothesis.

1) Mean of “Months_On_Book” for Attrited Customers is equal to

Null Hypothesis: Mean of “Months_On_Book” for Attritted

As p-value>0.05, we accept the null hypothesis.

Null Hypothesis: Mean of “Months_On_Book” for males is equal to

As p-value>0.05, we accept the null hypothesis.

Null Hypothesis: Mean of “Total_Relationships_Count” for male is

As p-value>0.05 we accept the null hypothesis.

Null Hypothesis: Mean of “Credit_Limit” for males is equal to the

As, p-value<0.05 we reject Null Hypothesis.

Null Hypothesis: Mean of “Total_Tran_Amt” for Attritted customers

As p-value<0.05 we reject Null Hypothesis.

1) Mean of “Months_On_Book” for all Card categories is the

Null Hypothesis: Mean of “Months_On_Book” is same for all card

As p-value>0.05, we accept the null hypothesis.

Null Hypothesis: Mean of “Months_Inactive_12_mon” is same for

As p-value>0.05, we cannot reject null hypothesis

3) Mean of “Months_on_book” is same for all education levels

Null Hypothesis: Mean of “Months_on_book” is same for all

As p-value>0.05, we accept the null hypothesis.

4) Mean of “Credit_Limit” is same for different Income category

Null Hypothesis: Mean of “Credit_Limit” is same for all Income

As p-value<0.05, we can reject null hypothesis.

5) Mean of “Total_Trans_Amt” is same for all income categories

Null Hypothesis: Mean of “Total_Trans_Amt” is same for all

As p-value>0.05, we cannot reject null hypothesis.

1) There is a positive corelation between “Customer_Age” and

There is a strong positive corelation between

2) There is a positive corelation between

There is a very less negative corelation exists between

3) There is a positive corelation between “Credit_Limit” and

There is no correlation between “Credit_Limit” and

This model is not the best fit model to predict relationship

2) “Months_on_book” can be predicted by

R-square is less than 0 hence it is not the best fit model

3) “Months_on_book” is determined by “Customer_Age”,