This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

### Publishers

Scribd Selects Books

Hand-picked favorites from

our editors

our editors

Scribd Selects Audiobooks

Hand-picked favorites from

our editors

our editors

Scribd Selects Comics

Hand-picked favorites from

our editors

our editors

Scribd Selects Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

P. 1

Data Analysis|Views: 10|Likes: 0

Published by verai1131

draft of a doc on data analysid

draft of a doc on data analysid

See more

See less

https://www.scribd.com/doc/129234809/Data-Analysis

08/30/2013

text

original

I.

Introduction FICO scores, ranging from 300 to 850, are an indicator of the credit risk level of a credit applicant and are used by lenders as an important determinant of the interest rate that a borrower will ultimately have to pay. FICO scores are based on the five following categories of credit information: Payment history (35 per cent) 1; Amount owed (30 per cent); Length of credit history (15 per cent); New credit (10 per cent); and Types of credit in use (10 per cent).

However, FICO scores are not the only factor determining the level of interest rates. The purpose of this document is trying to quantify the relative importance of three other variables thought to be significantly associated with the level of the interest rate of a loan: the amount requested, the length of the loan and the monthly income of the applicant. While income is normally constant over the short term, the length of the loan and the amount requested are thought to be controlled by the applicant. Therefore, estimating the amount of the individual impact of these variables on the level of the interest rate, may be of interest to prospective borrowers. Section II of this document reviews the methods used, including a brief description of the data set utilized, a summary of the exploratory data analysis conducted and an overview of the statistical modeling undertaken. Section III discusses results of an estimated multivariate linear regression model and presents some caveats that should be considered when making inferences on the basis of the estimated model. Finally, Section IV summarizes some conclusions that can be drawn from the analysis.

1

Fair Isaac Corporation (2011). Figures in parentheses indicate how much of FICO scores is approximately based on the corresponding category.

1

medical. and The ownership status of applicant (e.com/dataanalysis/loansData.. center. presumably unique identifiers of loans. major purchase. variation and extreme observations) and the nature of 2 . ver. This data set includes both quantitative and categorical variables.. and The length of time in the current employment (in years).500 peer-to-peer loans issued by the Lending Club (www. The number of credit inquires during the six months prior to the application. renter. The debt-to-income ratio of applicant (in per cent).s3. or paying mortgage). viii.amazonaws.lendingclub. The FICO range of the applicant (ranging from 640 to 834 in the sample).II. education. All data analyses in this document were conducted in R. Quantitative variables: i. The amount loaned to the applicant (in USD). house. The purpose of the loan (car. 2. etc. The state of residence of applicant.). Categorical variables: i. The amount requested (in USD). home owner.csv. v. ii. xi. iii. ii.action).com/home.15. Methods Data set and variables used The data set used in this document contains information on 2. iv. These variables are listed below. x. vi. The current number of lines of credit opened by the applicant. credit card. The total amount of outstanding credit of the applicant (in USD).1. The data set also includes loan numbers. iii. home improvement. The length of time of the loan (in months). ix. The interest rate of the loan (in per cent). The monthly income of applicant (in USD). vii.g. moving. debt consolidation.g. Exploratory data analysis Exploratory data analysis was used in an attempt at understanding important characteristics (e. The data set was contained in a csv file and was downloaded from the following URL: https://spark-public.

Panel (b) presents a histogram of income of applicants. observations greater than the 3rd quartile plus 1. therefore. the distribution of income is rightly skewed with three extreme outliers. boxplots. indicating that larger and lengthier loans tend to be associated with greater interest rates. are not considered in what follows so as to minimize the risks of overloading the regression model and the effects of multicollinearity. Statistical modeling A preliminary linear regression modeling the interest rate as a function of all numerical variables in the data set. notably monthly income.33. they were excluded from subsequent analyses. However. As can be observed in panel (b). two other variables -the number of credit inquires and the current number of lines of credit opened by the applicant.000 and 102. Finally.were not statistically significant at the 5-per cent level (p<.are known to be part of the elements composing FICO scores and. that is. labeled 41411. Furthermore. indicated that three variables -the debt to income ratio. is: IR = β1 + β2FICO +β3LL+β4AR+β5log(I)+ε. Furthermore. a boxplot of the distribution of income (not shown) indicated that there are quite a few additional mild outliers. 18439 and 54487 (incomes of USD 39. and was also excluded.496 complete observations. Panel (a) in the accompanying figure.583. respectively). It also suggests that both the length of the loan (highlighted in purple in the figure) and the amount requested (proportional to the size of the dots) tend to cluster towards the upper part of the figure. III. is a scatter plot showing that there is a negative relationship between the average of FICO range and the interest rate of loans.univariate and multivariate distributions for all variables in the data set. the finally estimated model. Results On the basis of 2. 3 . scatter plots.5 times the interquartile range. 65.750. these observations were not as extreme as those three mentioned above and were taken care of by a logarithmic transformation. loan number 101596 has incomplete information on some variables. In particular. Consequently. the amount of outstanding credit and the length of time in the current employment.001). histograms and summary tables were used. a variable expected to be strongly associated with the interest rate of a loan. These three observations were excluded from all subsequent analyses.

0004 per cent. the independent term. it should be expected that. β1. Note that due to the logarithmic transformation. after conditioning on the purpose of loan.1e-03). and An increase of USD 100 in the amount requested by one of the applicants.970e-03 (-5. In particular.8e-03.4e-03). panel (c) suggest that there seem to be a non-linear (presumably quadratic) pattern in the residuals not accounted for by the 2 Dickey et al (1989). assuming that two applicants had identical FICO scores and keeping all variables but one constant. -8. The coefficient of determination adjusted by the degrees of freedom is 0.3e-03.733e-04 (CI: -9. would increase the interest rate received by this applicant by barely 0. indicating that approximately 75 per cent of the variance of the interest rate is explained by the fitted model. 1.4e-06. LL is the length of the loan in months and AR is the amount requested in USD. with zero mean and constant variance2. and β5 = -3. in parenthesis) are: β1 = 7. assumed to be a random variable. The model includes an error term.9e-01. for instance: A 10 per cent positive difference favoring the monthly income of one of the applicants. Note that although all individual coefficients are highly significant (p<0.0014 per cent.5e-04). β2 = -8.001) and of the expected sign. It should be noted that the fitted model seems to be overestimating the interest rate. Estimated coefficients (confidence intervals.75. More importantly. Some caveats.7. An increase of the length of the loan of one of the applicants from 36 to 60 months. β4 = 1. or 0.516e-06 (CI:1. as can be observed by the many positive residuals in this panel.1)3 on the interest rate that this applicant would obtain. FICO is the average of FICO scores. 3 4 . -2.078e-01 (CI: 6. Panel (c) shows the relationship between the estimated interest rate and corresponding residuals.0002 per cent. CI. would have a marginal reduction (of approximately 0. would likely result in an increase on the interest rate of only 0. ε. 1.6e-06).0e-04. represents the level of the interest rate whem all variables at equal to zero.357e-03 (CI: 1.3e-01). β3 = 1. they are all likely to have only a marginal impact on the estimated interest rate of the relevant loan. the effect of income on the exchange rate is not linear in this model.where IR is the interest rate.00397*log(1. normally and independently distributed.

the pattern followed by residuals in this same panel also suggests that the assumption of equal variance does not hold.pdf (accesed on February 18. However. www. Furthermore. References Fair Isaac Corporation (2011). larger and lengthier loans are likely to have almost no impact on the interest rate of loans. John O. this finding should be interpreted with extreme caution.myfico. Conclusions Estimated results presented in section III suggest that. IV. Springer-Verlag: NY. Understanding your FICO Score. although all individual coefficients are highly significant (p<0. 2nd Ed. Rawlings and Sastry G.com/downloads/files/myfico_uyfs_booklet.. 5 .. Pantula (1989). 2013).001) and of the expected sign. other things being equal. as estimated residuals seem to violate some of the assumptions of the linear regression model. after controlling for the effects of FICO scores. Dickey.the variables included in the model. Applied Regression Analysis. their practical significance seems nil. David A. Finally.

Do New Prices Reflect

Which is the world's most expensive city Cost of living survey 2012 News theguardian.com

Data Analysis Final

1

1

S program

Lecture 7

Rplot

Data Analysis Plot

Lapply

Regular Expressions in R

debugging in R

A Brief History of S

Untitled

R Functions

R Functions

hospital

S-history

Understanding Health Statistics

Insisting on Beautiful Maps (FlowingData, 25 Oct. 2012) for R Presentation

An Example of Simpson s Paradox

A Review of Software Packages for Analizing Correlated Survival Data

Use of R as a Toolbox for Mathematical and Statistics Exploration

Ignoring a Covariate an Example of Simpson s Paradox

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->