Professional Documents
Culture Documents
Question 1
A company manager says that the average balance on their credit cards is $500. Do you think that this
assertion is justified? Use a one-sample t-test to draw your conclusion.
Answer
Approach: In order to solve this question, we use a T-test sample variable assuming unequal variances.
But if you see the data, we just have one variable to compare with So what we will do is, create a dummy variable
to compare the credit card balance data to the dummy variable.
• The most important point to discuss in this equation is the value of hypothesized mean difference which is
500 over here because the average balance we’re comparing with a dummy variable which is zero here
has a difference of 500
Analysis
Looking at the P-value which is greater than the significance level of 0.05 makes us not reject the null hypothesis
Also, t stat is smaller than the t critical, again we can not reject the null hypothesis.
Conclusion
There is not enough evidence to prove that the average balance on their credit card is not equal to 500$.
There is enough evidence to prove that the average balance on their credit card is equal to 500$.
Question 2
Is there a difference between men and women as far as average balance is concerned? Use a two-
sample t-test to draw your conclusion.
Approach: To solve this test we have to first filter out the data of mean balances of men and women and then
compare the T-test of two samples assuming unequal variances.
(Null Hypothesis) H0: Mean balance of Men and Women is the same M1=F1
(Alternative Hypothesis) H1:Mean balance of M1 (Male) and F1 (Female) are not equal (can be less or greater
but not equal)
Male M1 Female F1
Mean 509.8031088 529.5362319
Variance 213554.5652 210187.1043
Observations 193 207
Hypothesized Mean Difference 0
df 396
t Stat -0.42838443
P(T<=t) one-tail 0.334302083
t Critical one-tail 1.648710601
P(T<=t) two-tail 0.668604165
t Critical two-tail 1.965972608
Analysis
Looking at the P two-tail value which is greater than the significance level 0.05, 0.668604165424214 > 0.05
Conclusion
As the P value is greater than 0.05 so we cannot reject the null hypothesis that means we don’t have
enough evidence to prove that the mean balance of credit card between men and women is
significantly different.
Question 3
Is there a difference between students and non-students as far as average balance is concerned? Use a
two-sample t-test to draw your conclusion.
Approach: To solve this will again take data from the sheet filtering between students and nonstudents and their
balance in the credit card.
(Null Hypothesis) H0: Mean balance of Students and Nonstudents is the same M1=M2
(Alternative Hypothesis) H1:Mean balance of M1 (Students) and M2 (Non students) are not equal (can be less or
greater but not equal.)
Conclusion
From the two-sample T-test, we have found that the P-value is smaller than the significance level
which is 0.05 that means we have enough evidence to prove that there is a difference in the mean
balance of credit card between students and nonstudents, therefore, we reject the null hypothesis
which says that there is no difference.
Question 4
It is generally assumed that if there are more credit cards then the balance on the cards will be more.
Based on this dataset, do you think this is true? Calculate a correlation coefficient and show a scatter
plot to support your answer.
This is a regression equation in which one variable affects the other over here the number of cards is assumed to
be affecting the balance on credit cards so the number of credit cards is the variable X and the balance of the great
card is the variable Y which is also considered to be response so we’ll calculate the correlation coefficients which
will define the relationship between the two
Correlation Matrix
No of cards Balance
No of cards 1
Balance 0.086456347 1
The correlation coefficient for the of cards per person and the balance on their cards is 0.0864
Scatter plot
Balance (Y)
2500
2000
1500
1000
y = 28.987x + 434.29
500
0
0 2 4 6 8 10
Inference
1.As the correlation is close to 0, there is a weak relation between the number of card and balance.
2.The points in the graph forms linear trends having the linear graph close to the perpendicula,
therefore, showing a weak relation
Therefore, it is close to right that if there are more credit cards then then balance on the cards will be
more.
Question 5
Examine whether the following demographic variables influence balance: (a) age, (b) years of
education, (c) marital status. For age and years of education, use scatter plots to depict their
relationship with balance and calculate the correlation coefficient. For the relationship between
marital status and balance, use a two-sample t-test to draw your conclusion
A. Let’s take the first demographic variable which is age and let’s see the relationship between age and
the balance of the credit cards.
Correlation Matrix
Age Balance
Age 1
Balance 0.001835119 1
As for the correlation matrix the relationship between the age and the balance in the credit card is not a strong
relationship where we can see the change brought by age on the balance is just 1%
Let’s see an equation by using a scatter plot and see a relationship again by calculating the regression
quation
Balance (Y)
2500
2000
1500
500
0
0 20 40 60 80 100 120
Conclusion
Through the equation we can see that 0.04 is the coefficient which means that one unit change in age
which is one year will bring a change in balance of 0.04 dollars which is not very significant and age is
not age is not great predictor of why
B. let’s see the effect of years of education on the variable Y which is the credit card balance and
let’s see the relationship through correlation matrix.
Correlation Matrix
According to the correlation matrix the relationship between years of education and credit card balance is
negative so the correlation coefficient over here is - 0.080
Now let’s create a scatter plot and see the equation relation between the two
Scatter Plot
Balance Y
2500
2000
1500
y = -1.186x + 535.97
1000
500
0
0 5 10 15 20 25
Inference
From the equation derived from the scatterplot (Y= -1.186x + 535.97) you can see the a negative or
almost zero relationship between the years of education and the credit card balance which shows that
there is no significant relationship between years of education and balance.
C. Let's see how credit card balance get affected by marital status
To check if the mean balance of credit cards is different for married and non married will use a T test of two
sample assuming unequal variances
Conclusion
Looking at the Pvalue which is greater than 0.05 which clearly indicates that we fail to reject the null
hypothesis, which means then we don't have significant evidence to prove that the mean difference of
both the variables are different from each other, so the marital status does not impact the card
balance.
Question 6
Does Ethnicity of the cardholder matter as far a balance is concerned?
Approach: To analyze this, we have to follow some tests to check whether the mean of all the ethnicities
present in the data are equal or not
(Null Hypothesis) H0 : Average balance of credit cards are equal of all the 3 ethnicities, E1=E2=E3
(Alternative Hypothesis) H1 :Average balance of credit cards are not equal all the 3 ethnicities, E1≠E2≠E3
To check the mean differences of all the three ethnicities will use a ANOVA ANALYSIS OF VARIANCES) to
check if there is a difference or not
SUMMARY
Groups Count Sum Average Variance
African American 99 52569 531 235839.1633
Asian 102 52256 512.3137255 231748.3362
Caucasian 199 103181 518.4974874 190922.4129
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 18454.20047 2 9227.100236 0.043442783 0.957492 3.018452
Within Groups 84321457.71 397 212396.6189
Conclusion
Looking at the P value which is greater than 0.05 significance level we come to a conclusion that we
cannot reject the null hypothesis which means that the mean of credit card balance of all the three
ethnicities are close to equal and we do not have significant evidence to prove that ethnicities impact
the balance
Question 7
A general principle that credit card companies often follow is to assign a higher credit limit to people
with a higher credit rating. Does the data show that this principle is being followed?
Approach: This is a regression question where the variable X will be a variable which will predict or affect the
value of. In this case the variable X is the credit card rating and variable Y is the balance in the credit card to
check their relationship we wil wil use a correlation matrix.
Rating Limit
Rating 1
Limit 0.99688 1
If u see the relationbetween rating and limit is a strong relation and it is a great predictor of y
As it is 99%
Limit
16000
14000 y = 14.872x - 542.93
12000
10000
8000
6000
4000
2000
0
0 200 400 600 800 1000 1200
Inference
As per the eqation the balce of the credit card increases by 14.872 dollars if there is an increase in
rating by 1 unit, so, there is a relation between credit limit and rating. Also, the linear trend shows
that it has a positive correlation.
Question 8
Run a simple linear regression of balance on the credit limit. (Here credit limit is the X and the balance
is the Y). Report the coefficients and the R-squared. Show a scatter plot.
We will do a simple linear regression analysis to see how much X (CARD LIMIT) predicts the Y (BALANCE)
Regression model
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8616973
R Square 0.7425222
Adjusted R
Square 0.7418753
Standard Error 233.585
Observations 400
ANOVA
Significance
df SS MS F F
Regression 1 62624255 62624255 1147.7642 2.53E-119
Residual 398 21715657 54561.951
Total 399 84339912
From the above regression model, we conclude that the P-value is less than 0.05 which means the Card
limit is a significant predictor For card balance.
- The coefficients of limit 0.17 mean that there with an increase of .17$ in balance with an
increase of limit by 1$
- The R-square indicates that if I want to influence my credit card balance then 74% of it can be
done by varying the credit card limit
Let perform a scatter plot to see the regression equation
1500
1000
500
0
0 2000 4000 6000 8000 10000 12000 14000 16000
-500
Inference
Looking at regression model where P-value is less than 0.05 shows that card limit is a great predictor
for balance on credit card and also the pattern of a scatter plot it is seen there is progressive relation
between the two. So If limit goes up by 1 unit, balance goes up by 0.1716.
Question 9
Run a simple linear regression of balance (Y) on credit rating (X). Report the coefficients and R-
squared. Show a scatter plot
We will do a simple linear regression analysis to see how much X (CARD RATING) predicts the Y (BALANCE)
Regression model
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.863625161
R Square 0.745848418
Adjusted R
Square 0.745209846
Standard Error 232.0713048
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 62904789.88 62904789.88 1167.994581 1.8989E-120
Residual 398 21435122.03 53857.09053
Total 399 84339911.91
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
-
Intercept 390.8463418 29.06851463 -13.44569362 3.07318E-34 -447.993365 -333.6993186 -447.993365 -333.6993186
Rating (x) 2.566240327 0.075089102 34.1759357 1.8989E-120 2.418619483 2.713861171 2.418619483 2.713861171
From the above regression model, we conclude that the P-value is less than 0.05 which means the Card
Rating is a significant predictor For card balance
- The coefficients of Rating 2.566 mean that there with an increase of 2.566 in balance with an
increase of rating by 1$
- The R-square indicates that if I want to influence my credit card balance then 74% of it can be
done by varying the credit card rating
Scatter Plot
Blance (Y)
2500
y = 2.5662x - 390.85
2000
1500
1000
500
0
-500 0 200 400 600 800 1000 1200
Inference
Looking at regression model where P-value is less than 0.05 shows that card rating is a great predictor
for balance on credit card and also the pattern of a scatter plot it is seen there is progressive relation
between the two. So If rating goes up by 1 unit, then balance goes up by 2.566.
Q10. Consider your findings in questions 8-9. Discuss business mechanisms to increase or decrease the
balance on credit cards. Try to quantify your answers.In this context, focus on possible specific
strategies using variables in Q8 and Q9 that the business could adopt to increase the balance on credit
cards
Interpretation of the slope Coefficient of credit limit is positive therefore, with $1 increase in credit card limit , the average balance of credit card
(coefficient of the limit) - increases by $0.1716.
When credit card limit is equal to zero, there is a negative balance on the credit card of $292.79. This may
not make much sense from the business perspective where balance does not apply to a zero credit limit
Interpretation of the intercept - case.
About 74.25% of the variation in average credit card balance is due to the credit card limit and remaining 25.75% is due
R² Interpretation to other dependant factors.
Based on 9th
P-value (1.8988) is less than 0.05, therefore credit rating is a predictor of balance.
Coefficients
y = 2.5662x - 390.85 where; y = balance on the credit card, x= credit rating
The coeffi ci ent of credi t ra ti ng i s pos i ti ve therefore, wi th $1 i ncrea s e i n credi t ra ti ng , the a vera ge ba
Bus i nes s i mpl i ca ti on of credi t ra ti ng- credi t ca rd i ncrea s es by $ 2.5662.
Interpreta ti on/Bus i nes s i mpl i ca ti on of the
When
i ntercept
credi t ra
- ti ng i s equa l to zero, there i s a nega ti ve ba l a nce on the credi t ca rd of 390.85.
ANOVA
Significance
df SS MS F F
Regression 2 63132707.37 31566353.68 590.9238244 9.7585E-120
Residual 397 21207204.54 53418.65124
Total 399 84339911.91
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0%
- - -
Intercept 369.0359554 36.16414657 10.20447018 7.22692E-22 -440.133128 297.9387828 -440.133128
No . Cards X1 26.03375427 8.438363509 3.085166246 0.002176819 9.444290848 42.62321769 9.444290848
Limit X2 0.171479037 0.005013136 34.20593861 2.0023E-120 0.161623424 0.18133465 0.161623424
A. The coefficient of No of cards (x1) is 26.033
The coefficient of limit (x2) is .1714
B.
Approach: Let's discuss the regression model here where coefficients are different when they were taken in
simple linear regression which was for No of cards was 28.98
The reason is that now we have one more predictor that runs simultaneously with the other so when No of
cards increase limit is kept constant and the limit increased the no of cards are kept constant
C. If limit goes up by 1 unit, balance goes up by 0.1714 (keeping cards constant).
D. If cards go up by 1 unit, balance goes up by 26.0337 (keeping limit constant).
Question 12
Run a simple linear regression equation with Income as X and Balance as Y. Report the coefficients. Is
the coefficient of Income significantly different from zero? What does this say about the effect of
income on balance?
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.463656457
R Square 0.21497731
Adjusted R
Square 0.213004891
Standard Error 407.8647195
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 18131167.4 18131167.4 108.9917152 1.03089E-22
Residual 398 66208744.51 166353.6294
Total 399 84339911.91
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 246.5147506 33.19934735 7.425289058 6.90344E-13 181.2467485 311.7827527 181.2467485 311.7827527
Income X 6.048363409 0.579350163 10.43990973 1.03089E-22 4.909394402 7.187332415 4.909394402 7.187332415
(A) The coefficient of a regression model is of income and Balance on cards are
Intercept 246.514
Income X 6.048
(C) From the regression model, we have got the coefficient of income at 6.048 which says that there
will be an increase in the balance my 6.048 if the income gets increased by one unit
(D) If the card goes up by 1 unit, balance goes up by 26.0337 (keeping the limit as a constant).
Question 13
Based on the equation derived in question 12, what is the estimated balance for a person with an
income of USD 100k per year?
Let’s create a scatter plot for the above question to get the regression equation
Balance Y
2500
2000
1500
1000
500
y = 6.0484x + 246.51
0
0 20 40 60 80 100 120 140 160 180 200
Y= 6.048(100) +246.51
Y= 851.35
Regression Statistics
Multiple R 0.936702578
R Square 0.87741172
Adjusted R
Square 0.875856031
Standard
Error 161.9917647
Observations 400
ANOVA
Significance
df SS MS F F
Regression 5 74000827.17 14800165.43 564.0020686 4.5908E-177
Residual 394 10339084.74 26241.33183
Total 399 84339911.91
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
- - - - - -
Intercept 473.2514026 55.10833546 8.587655545 2.08837E-16 581.5945666 364.9082387 581.5945666 364.9082387
- - - -
Income X1 7.608832003 0.381931562 19.92197755 1.37077E-61 8.359710677 -6.85795333 8.359710677 -6.85795333
- -
Limit X2 0.07901642 0.044791005 1.764113581 0.078487737 0.009042839 0.167075679 0.009042839 0.167075679
Rating X3 2.773843725 0.667079559 4.158190261 3.93909E-05 1.462363177 4.085324273 1.462363177 4.085324273
- - - -
Age X4 0.860030445 0.478700493 1.796594023 0.073165937 1.801157147 0.081096257 1.801157147 0.081096257
- -
Education X5 1.967791521 2.605290902 0.755305874 0.450516748 3.154218733 7.089801776 3.154218733 7.089801776
Presented by
ARPAN BHATIA
PGPMex ‘21