You are on page 1of 16

BUSINESS STATISTICS

Project
Installment – I

Group – 8
Abhishek Singh (173)
Girish Matada (194)
Jai Kishore Jangir (198)
Paras Kohli (210)
Tushar Rane (233)
Unnati Kandelwal (234)
Installment 1
1. (a) Describe the distribution of the PRSM scores using both graphics and
descriptive statistics.
(b) Does it appear reasonable to use the Empirical Rule to describe the variation
of PRSM? Briefly explain your reasoning.
(c) If you were to remedy any evident anomalies with this variable, how well
would the Empirical Rule work now?

Ans
(a) PRSM score is the PRSM = (2*Amt paid in 6 months)/Total amt to be paid
CODE : Descriptive Statistics
mean(initial_PRSM$Amt.Repaid.at.6.Months)
median(initial_PRSM$Amt.Repaid.at.6.Months)
getmode(initial_PRSM$Amt.Repaid.at.6.Months)

mean(PRSM$Amt.Repaid.at.6.Months)
median(PRSM$Amt.Repaid.at.6.Months)
getmode(PRSM$Amt.Repaid.at.6.Months)

OUTPUT:
GRAPHICAL Representation

With Outliers:

Without Outliers:
(b) Does it appear reasonable to use the Empirical Rule to describe the variation of
PRSM? Briefly explain your reasoning.
Ans) No, as it has already been indicated in the (a) part, 99.84% of values are falling
standard deviation and hence it is not advisable to use the empirical rule.

CODE:
mean(initial_PRSM$Amt.Repaid.at.6.Months)
sd(initial_PRSM$Amt.Repaid.at.6.Months)

mean(PRSM$Amt.Repaid.at.6.Months)
sd(PRSM$Amt.Repaid.at.6.Months)
OUTPUT:

(c) If you were to remedy any evident anomalies with this variable, how well would
the Empirical Rule work now?
Ans) By removing the Outliers, we were able to justify the empirical rule.
2. Management is concerned that current lending procedures produce loans that,
on average, have PRSM scores below the target 1. After correcting the anomaly
noted in Question 1, does a confidence interval indicate that management should
indeed be concerned that average PRSM scores are lower than desired (i.e.,
lower than 1)? Be sure to justify the use of a confidence interval in this context.
Ans)
Standard Error ----> @ Confidence Level 95%
s <-sd(PRSM$Amt.Repaid.at.6.Months)

error <- qt(0.975,df=length(PRSM$Amt.Repaid.at.6.Months)-


1)*s/sqrt(length(PRSM$Amt.Repaid.at.6.Months))
error
left <- m-error
right <-m+error

left
right
m

OUTPUT:
3. Control charts can be used to measure the stability of many types of data,
including the performance of loans. Assume that the loans in your data table are
arranged in chronological order, starting from the first row through row 628.
Generate a control chart of the PRSM scores for your sample, completing the
JMP dialog as shown below. These choices set the process mean μ = 0.9, the
standard deviation σ = 0.24, and group the loans into batches of size 40. Be sure
to resolve the anomaly noted in the first question.
(a) Do the resulting x-bar and s-charts indicate that the lending process has been “in
control” over the sampled period? What are the implications, especially with regard
to the confidence interval in Question 2?

(b) Why are the control limits for the mean in the x-bar chart so much wider than
the confidence interval for the mean used in Question 2? Two reasons, please!

Ans) The control limits are based on 6σ while the control limits in question 2 were based
on 4σ for 95% confidence interval.
4. (a) The variable Years in Business may ultimately be useful in predicting the
PRSM score. Describe the distribution of this variable.
(b) Describe the distribution of the variable defined by log(1 + Years in
Business). In particular, is the variation in the transformed variable “nearly
normal”?
Ans 4. (a)

OUTPUT:

The plot shows that the data follows negative exponential distribution and not normal
distribution.
(b) Describe the distribution of the variable defined by log(1 + Years in Business). In
particular, is the variation in the transformed variable “nearly normal”?

Ans)

OUTPUT:

Approximately Normally Distributed.


5. The distribution of Average House Value in Zip Code reveals an unusual
feature that appears to be an artificial consequence of the data processing. If
this artificial feature could be corrected so that the data showed the actual
average house values, then:
(a) How would the mean and standard deviation of this variable change?
Would x-bar and s remain unchanged, increase or decrease?
(b) Would the median and IQR of this variable remain unchanged, increase or
decrease?

Ans 5. (a)Solution to find the Summary of Average House Value in Zip Code in dataset.

OUTPUT:

As we can see the anomaly in the Mean of the DataFrame, this means the Average House
Value in Zip Code has a value that is far away from the usual values, meaning an outlier.
The value of s is mentioned in the screenshot above.
OUTLIERS:

Now, to remove these anomalies, the Outliers need to be removed.


BOXPLOT for OUTLIERS:
CODE, to remove outliers is listed below:

summary(df_2$Average.House.Value.in.Zip.Code)
summary(df_2_without_Outliers$Average.House.Value.in.Zip.Code)

OUTPUT:
(b) Would the median and IQR of this variable remain unchanged, increase or decrease?

CODE:

OUTPUT:
6. An analyst extracted a sample of 25 loans from the same population as your
sample of loans. Estimate the probability that the average PRSM in a sample of
25 from this population is greater than 0.9. Identify any relevant assumptions
and state why you believe them to be plausible.
Mean 0.8094
Std Dev 0.2067
Kurtosis -0.0216

Ans 6. (A)
Assumption:
1. The given mean/Std. Dev/ Kurtosis is for population
2.

OUTPUT:

Here Probability for finding the Mean of the sample of 25 to be greater than 0.9 is
0.01420485.
7. Would you be surprised if the sample of 25 obtained by the analyst described
in the prior question contains fewer than 5 loans that were originated from
ISO named “Credit Divas”? Explain your answer and justify any assumptions
you have made.

Ans) It’s not a big surprise as getting 5 credit divas from random sample of 25 is just
20%. This sample contains 685 credit divas out of 1687 equals to 41 %. so there is a
good possibility of obtaining such a sample. Weighted Sampling can be used to avoid
such biased scenarios.

8. For this question, recode the PRSM score into a two-level categorical variable,
labeled as “Above” or “Below” depending on whether the PRSM score is above
or below the average PRSM score in your dataset.
It has been suggested that loans given to repeat customers (Loan Type
identifies this variable) perform worse than those given to first time borrowers.
To what extent is this assertion supported by the data? Provide a brief
discussion using ideas from the first two lectures to support your answer.

 From, the Cross table Type-1 refers to original customer, Type 2 refers to
repeating customer,
 Below and high are categorized as per the target PRSM score of 1.
 As the table suggests 85.5% of original customers are not meeting the target
whereas 75.5% of repeating customers are not meeting the target PRSM. So our
assumption cannot be justified based on this table.
9. Does your answer to the previous question imply that Original/Repeat loan
status is the cause of an improved PRSM score or might there be another
explanation? If so, suggest a possible “lurking” variable that could influence
the comparison (the lurking variable can be a hypothetical one, it doesn’t have
to exist in the data set). Otherwise, explain briefly why it is not possible.
(Reading Section 5.2 of SF could be helpful here.)

From the correlation matrix we can notice that there is no significant correlation
between the variables and PRSM , hence a new variable by variable transformation
is required.

10. If you wanted to construct an approximate 95% confidence interval for the
proportion of future loans that originate from ISO Loan Masters, and this CI was to
have a margin of error of +/-2%, then what sample size would you recommend?
Ans)