CREDIT RATING REPORT INSTALLMENT 1

Project

Installment – I

Group – 8

Abhishek Singh (173)

Girish Matada (194)

Jai Kishore Jangir (198)

Paras Kohli (210)

Tushar Rane (233)

Unnati Kandelwal (234)

Installment 1

1. (a) Describe the distribution of the PRSM scores using both graphics and

descriptive statistics.

(b) Does it appear reasonable to use the Empirical Rule to describe the variation

of PRSM? Briefly explain your reasoning.

(c) If you were to remedy any evident anomalies with this variable, how well

would the Empirical Rule work now?

Ans

(a) PRSM score is the PRSM = (2*Amt paid in 6 months)/Total amt to be paid

CODE : Descriptive Statistics

mean(initial_PRSM$Amt.Repaid.at.6.Months)

median(initial_PRSM$Amt.Repaid.at.6.Months)

getmode(initial_PRSM$Amt.Repaid.at.6.Months)

mean(PRSM$Amt.Repaid.at.6.Months)

median(PRSM$Amt.Repaid.at.6.Months)

getmode(PRSM$Amt.Repaid.at.6.Months)

OUTPUT:

GRAPHICAL Representation

With Outliers:

Without Outliers:

(b) Does it appear reasonable to use the Empirical Rule to describe the variation of

PRSM? Briefly explain your reasoning.

Ans) No, as it has already been indicated in the (a) part, 99.84% of values are falling

standard deviation and hence it is not advisable to use the empirical rule.

CODE:

mean(initial_PRSM$Amt.Repaid.at.6.Months)

sd(initial_PRSM$Amt.Repaid.at.6.Months)

mean(PRSM$Amt.Repaid.at.6.Months)

sd(PRSM$Amt.Repaid.at.6.Months)

OUTPUT:

(c) If you were to remedy any evident anomalies with this variable, how well would

the Empirical Rule work now?

Ans) By removing the Outliers, we were able to justify the empirical rule.

2. Management is concerned that current lending procedures produce loans that,

on average, have PRSM scores below the target 1. After correcting the anomaly

noted in Question 1, does a confidence interval indicate that management should

indeed be concerned that average PRSM scores are lower than desired (i.e.,

lower than 1)? Be sure to justify the use of a confidence interval in this context.

Ans)

Standard Error ----> @ Confidence Level 95%

s <-sd(PRSM$Amt.Repaid.at.6.Months)

1)*s/sqrt(length(PRSM$Amt.Repaid.at.6.Months))

error

left <- m-error

right <-m+error

left

right

m

OUTPUT:

3. Control charts can be used to measure the stability of many types of data,

including the performance of loans. Assume that the loans in your data table are

arranged in chronological order, starting from the first row through row 628.

Generate a control chart of the PRSM scores for your sample, completing the

JMP dialog as shown below. These choices set the process mean μ = 0.9, the

standard deviation σ = 0.24, and group the loans into batches of size 40. Be sure

to resolve the anomaly noted in the first question.

(a) Do the resulting x-bar and s-charts indicate that the lending process has been “in

control” over the sampled period? What are the implications, especially with regard

to the confidence interval in Question 2?

(b) Why are the control limits for the mean in the x-bar chart so much wider than

the confidence interval for the mean used in Question 2? Two reasons, please!

Ans) The control limits are based on 6σ while the control limits in question 2 were based

on 4σ for 95% confidence interval.

4. (a) The variable Years in Business may ultimately be useful in predicting the

PRSM score. Describe the distribution of this variable.

(b) Describe the distribution of the variable defined by log(1 + Years in

Business). In particular, is the variation in the transformed variable “nearly

normal”?

Ans 4. (a)

OUTPUT:

The plot shows that the data follows negative exponential distribution and not normal

distribution.

(b) Describe the distribution of the variable defined by log(1 + Years in Business). In

particular, is the variation in the transformed variable “nearly normal”?

Ans)

OUTPUT:

5. The distribution of Average House Value in Zip Code reveals an unusual

feature that appears to be an artificial consequence of the data processing. If

this artificial feature could be corrected so that the data showed the actual

average house values, then:

(a) How would the mean and standard deviation of this variable change?

Would x-bar and s remain unchanged, increase or decrease?

(b) Would the median and IQR of this variable remain unchanged, increase or

decrease?

Ans 5. (a)Solution to find the Summary of Average House Value in Zip Code in dataset.

OUTPUT:

As we can see the anomaly in the Mean of the DataFrame, this means the Average House

Value in Zip Code has a value that is far away from the usual values, meaning an outlier.

The value of s is mentioned in the screenshot above.

OUTLIERS:

BOXPLOT for OUTLIERS:

CODE, to remove outliers is listed below:

summary(df_2$Average.House.Value.in.Zip.Code)

summary(df_2_without_Outliers$Average.House.Value.in.Zip.Code)

OUTPUT:

(b) Would the median and IQR of this variable remain unchanged, increase or decrease?

CODE:

OUTPUT:

6. An analyst extracted a sample of 25 loans from the same population as your

sample of loans. Estimate the probability that the average PRSM in a sample of

25 from this population is greater than 0.9. Identify any relevant assumptions

and state why you believe them to be plausible.

Mean 0.8094

Std Dev 0.2067

Kurtosis -0.0216

Ans 6. (A)

Assumption:

1. The given mean/Std. Dev/ Kurtosis is for population

2.

OUTPUT:

Here Probability for finding the Mean of the sample of 25 to be greater than 0.9 is

0.01420485.

7. Would you be surprised if the sample of 25 obtained by the analyst described

in the prior question contains fewer than 5 loans that were originated from

ISO named “Credit Divas”? Explain your answer and justify any assumptions

you have made.

Ans) It’s not a big surprise as getting 5 credit divas from random sample of 25 is just

20%. This sample contains 685 credit divas out of 1687 equals to 41 %. so there is a

good possibility of obtaining such a sample. Weighted Sampling can be used to avoid

such biased scenarios.

8. For this question, recode the PRSM score into a two-level categorical variable,

labeled as “Above” or “Below” depending on whether the PRSM score is above

or below the average PRSM score in your dataset.

It has been suggested that loans given to repeat customers (Loan Type

identifies this variable) perform worse than those given to first time borrowers.

To what extent is this assertion supported by the data? Provide a brief

discussion using ideas from the first two lectures to support your answer.

From, the Cross table Type-1 refers to original customer, Type 2 refers to

repeating customer,

Below and high are categorized as per the target PRSM score of 1.

As the table suggests 85.5% of original customers are not meeting the target

whereas 75.5% of repeating customers are not meeting the target PRSM. So our

assumption cannot be justified based on this table.

9. Does your answer to the previous question imply that Original/Repeat loan

status is the cause of an improved PRSM score or might there be another

explanation? If so, suggest a possible “lurking” variable that could influence

the comparison (the lurking variable can be a hypothetical one, it doesn’t have

to exist in the data set). Otherwise, explain briefly why it is not possible.

(Reading Section 5.2 of SF could be helpful here.)

From the correlation matrix we can notice that there is no significant correlation

between the variables and PRSM , hence a new variable by variable transformation

is required.

10. If you wanted to construct an approximate 95% confidence interval for the

proportion of future loans that originate from ISO Loan Masters, and this CI was to

have a margin of error of +/-2%, then what sample size would you recommend?

Ans)

