Application of Statistical Tools in EmpR Research-AB

Application of Statistical Tools in
Empirical Research
Dr. Arindam Bandyopadhyay
National Institute of Bank Management
What is Quantitative Research?

Quantitative research is about measurements.
Statistics and econometrics are the most widely used branch of
mathematics in quantitative research
Quantitative research using statistical methods typically begins
with the collection of data based on a theory or hypothesis,
followed by the application of descriptive or inferential statistical
methods.
Those who are likely to be a successful researchers/analysts are
more usually attracted by the problem solving side of the work and
the practical application of the mathematics and logic rather than
the mathematics/statistical concepts per se.
2
The Empirical Research Process
1. Interest->The topic or theme of research

2. Reading earlier research and theoretical research
3. Specification of the research problems-Research questions,
Framing hypotheses, Conceptualization
4. Planning Research process, Empirical Research design
5. Selection of variables and empirical tools-Source of data.
6. Data collection
7. Data filtering, coding, sorting
8. Data analysis (quantitative analysis using statistical packages)
1. Univariate/multivariate descriptive analysis
2. Multivariate Regression analysis
3. Diagnostic Checks
Research Process…
8. Data analysis (quantitative analysis using statistical packages)
1. Univariate/multivariate descriptive analysis
2. Multivariate Regression analysis
3. Diagnostic Checks
9. Interpretation of results
1. Answering empirical questions
2. Explanation of results
10. Drawing conclusions and policy actions
11. Bibliographic citations
12. Finalizing the report (sequencing the charts, tables, footnotes,
abstract and text etc.)
13. Publication of results
4
Data
When considering the establishment of a framework for statistical
testing, it is sensible to ensure the availability of a large enough
set of reliable information on which to base the test. For
example, if the analyst intends to find `one-in-five-year event’ the
best way is to have a five-year database.
Problem Solving Approach

Data Analysis: Summary Statistics
Central tendency/expectations
Dispersion/volatility
Understanding Distribution Fitting
In-dept analysis of Data
Covariance and correlation
Basic Concepts in Probability, Joint Probability
Discrete & Continuous Distribution
Hypotheses Testing
Modeling & Forecasting
Simulation & Value at Risk (VaR) Techniques
Simple Linear Regression
Multi-Variate Regression-MDA, Multiple Regression, Logistic Regression
etc.
Time Series Analysis
Diagnostic Checks or Validation
6
Descriptive Analysis
In descriptive analysis we are interested in describing a single
issue or social phenomenon (e.g. its frequency, distribution or
magnitude).
We can simply describe (e.g. unemployment rate, average
wages, support of different parties) or describe and compare
(default rates in different regions, average wages in different
professions of customers, party alignment within different social
strata)
Univariate analysis deals with single variables
Descriptive Methods
Frequency Distribution (grouped)-histogram, frequency curve,

cumulative distribution etc.
Measurement of central tendency (mean, media, mode,
percentiles etc.)
Dispersion-SD, mean deviation, CV, range, moments, skew-ness,
kurtosis etc.
Forms of distribution-discrete vs. continuous
8
Grouped Frequency Distribution
Grouped frequency distribution is a tabular summary of data

showing the frequency of items in each of several non-overlapping
classes
Class interval Width= (Largest -Smallest value)/No. of class
Relative frequency=Frequency of the class/n
Percentage frequency=Relative frequency*100
Cumulative frequency-less than type or more than type
Graphic Presentation
Histogram
Frequency polygon
Ogive (cumulative)
Lorenz Curve
Frequency Distribution
Frequency distribution tabulates and presents all the occurring
values arranged in order of magnitude and their respective
frequencies.
An inspection of the frequency distribution gives a quick idea about
the average in the series and shows how the observations vary
around the average (through plotting a histogram or frequency
polygon drawn from the frequency distribution.
Frequency = simple count of the cases with a certain variable

value
Percent = percentage of the cases with a certain variable value
Cumulative Percent = percentage of the cases with the given or a
smaller value
10
Descriptive Stats about Zone-wise
Loan Distribution of a Bank
zone_group p1 p5 p10 p25 p50 p75 p90 p95 p99 min max range mean sd cv Kurto Gini HHI
Central_Z_I 0.11 0.4 0.69 1 1.68 3 7.67 11.15 84.79 0.01 90.89 90.9 4.042 10.40 2.57 56.47 0.634 0.356
Central_Z_II 0.03 0.39 0.62 0.99 1.43 2.54 6.97 13.71 107.7 0.01 211.43 211.4 5.410 20.75 3.83 76.68 0.724 0.519
East_Z 0.02 0.51 0.95 1.35 2.39 10.8 30.5 55.84 260 0.01 1251 1251.0 20.703 97.27 4.70 137.38 0.792 0.598
Mumbai_Z 0.04 0.29 0.64 1.48 4.14 15 49.6 133.7 560 0.004 1204.4 1204.4 27.815 91.71 3.30 67.32 0.786 0.572
North_Z 0.11 0.5 0.82 1.2 2.22 5.49 13.4 41.45 183.3 0.01 731.03 731.0 10.620 44.61 4.20 159.31 0.739 0.519
South_Z_I 0.13 0.83 0.97 1.31 2.4 6.41 24.3 38.27 97.3 0.02 380.5 380.5 8.735 23.87 2.73 155.87 0.701 0.421
South_Z_II 0.07 0.41 0.79 1.37 3.11 9.74 29 59 272.4 0.04 400 400.0 13.249 37.77 2.85 67.84 0.720 0.442
West_Z_I 0.21 0.73 0.94 1.53 3.27 11 29.7 105.5 225.1 0.12 385.4 385.3 18.296 48.88 2.67 31.59 0.759 0.547
West_Z_II 0.22 0.69 0.83 1.23 2.34 4.93 13.7 27.16 50.32 0.07 99.54 99.5 6.108 11.55 1.89 32.66 0.619 0.299
Total 0.08 0.46 0.79 1.24 2.51 7.94 26.1 52.41 250 0.004 1251 1251.0 15.505 62.26 4.02 163.61 0.771 0.578
Frequency = simple count of the cases with a certain variable value

Percent = percentage of the cases with a certain variable value
Cumulative Percent = percentage of the cases with the given or a
smaller value
11
Quartiles and Percentiles

Quartiles divide an ordered lists into quarters.
For example, the first quartile (Q1) is a number greater (or equal) than the values
of 25% of the cases and lower (or equal) than the values of the remaining 75%.
In financial risk management quantile chosen would be 90%, 95% or 99% in most
cases since the largest losses can be observed at extreme quanitiles. E.g. Op-Risk
Capital from loss distribution (LDA) can be quantified by determining the 100p%
quantile for simulated distribution.
Percentiles divide ordered lists into hundredths.
One percent (p1) of the cases lie below the first percentile and 99% lie above it.
For example 1st quartile (Q1) equals 25th percentile (p25).
For example, all the cases of a real sample of employees (N=1112) are ordered on
the line below according to the monthly income (in Rs.).
e.g. median is the value of 556th and 557th cases (or their average)
12
Measures of dispersion
Range=maximum value-minimum value
Interquartile range(IQR)=Q3-Q1
Standard Deviation (SD), Variance (SD2)
Coefficient of Variation (CV)=SD/Mean
Skeweness (sk)=3(Mean-Median)/SD
=(Mean-Mode)/SD
or =[(Q3-Q2)-(Q2-Q1)]/[(Q3-Q2)+(Q2-Q1)]
or=3 moment=√β1=µ3/σ3
rd
Kurtosis=4th moment=β2-3=(µ4/µ22)-3
= µ4/σ4
If β2<3, distribution is platykurtic (thick tail but less peaked-ness); if
β2>3, distribution is leptokurtic (thin tail but high peaked-ness).
When β2 =3, distribution becomes normal or mesokurtic or
symmetric
13
Descriptive Statistics: Mean, VARIANCE,

SKEW, KURTOSIS, Gini, HHI
These are the four moments about mean describe the nature of
loss distribution in risk measurement.
The mean is the location of a distribution & Variance or the square
of standard deviation measures the scale of a distribution.
The Skew is a measure of the asymmetry of the distribution. In risk
measurement, it tells us whether the probability of winning is
similar to the probability of losing and the nature of losses.
Negative skewness means there is a substantial probability of a big
negative return. Positive skewness means that there is a greater-
than-normal probability of a big positive return.
Kurtosis is useful in describing extreme events (e.g., losses that are
so bad that they only have a 1 in 1000 chance of happening).
In the extreme events, the portfolio with the higher kurtosis would
suffer worse losses than the portfolio with lower kurtosis.
Skewness and Kurtosis are called the shape parameters
14
Moments and the Nature of Distribution
For a normal distribution, skewness=0

The Kurtosis for the Normal distribution is 3.
Normal distribution is so commonly used (especially in the
credit risk) that some researchers define the “excess kurtosis”
as being the calculations above minus 3.
Distributions with a kurtosis greater than the Normal
distribution are said to be leptokurtic.
15
Kurtosis
Since Kurtosis measures the shape of the distribution (the fatness of the tails), it
focuses on losses are ranged around the mean.
Leptokurtic means smaller proportion of medium sized deviation from mean, but
larger proportion of extremely large an small deviation from mean. Kurtosis
greater than three indicates a sharp/high peak with a thin midrange and fat tails
(super Gaussian type e.g. Pareto distribution, Long normal distribution, Weibull
distribution etc.)
Platykurtic means smaller proportion than normal deviation from mean that are
extremely small or large and a larger proportion of medium sized deviations from
mean (may happen in stock return distribution). Kurtosis of less than three
indicates a low peak with a fat midrange on either side (short tails-sub Gaussian
type e.g. Bernoulli distribution)
A normal distribution is called mesokurtic and it has a kurtosis of 3 (it is a thin tail
distribution).
16
Difference between Skewness & Kurtosis
Skewness - measures the degree and direction of symmetry or
asymmetry of the distribution.
A normal or symmetrical distribution has a skewness of zero (0). But in
the operational loss results, normal distributions are hard to come by.
Therefore, a distribution may be positively skewed (skew to the right-loss
series; longer tail to the right; represented by a positive value) or
negatively skewed (skew to the left; longer tail to the left; with a
negative value-return series).
Kurtosis - measures how peaked a distribution is and the lightness or
heaviness of the tails of the distribution. In other words, how much of
the distribution is actually located in the tails?
A positive kurtosis value means that the tails are heavier than a
normal distribution and the distribution is said to be leptokurtic (with a
higher, more acute "peak"). A negative kurtosis value means that the
tails are lighter than a normal distribution and the distribution is said to
be platykurtic (with a smaller, flatter "peak").
17
Measures of Moments
Distribution is fully described by 4 moments

m1= X/n is mean
m2= (X-X-)2/n is variance
m3= (X-X-)3 /n is absolute measure of skewness
m4= (X-X-)4/n is absolute measure of kurtosis
Relative measure of Sk=m3/SD3 ranges between + and -3 (Sk=0 indicates

symmetric distribution)
Relative measure of kurtosis =m4/SD4 kr=3 indicates mesokurtic
18
Herfindahl-Hirschman Index (HHI)
The Herfindahl index is a commonly used ratio to measure
concentrations/inequality of the distribution.
The Herfindahl index measures concentration as the sum of the
squared business share of each loan in the pool (or portfolio). i.e.,
N
∑E 2
n N
HHI = n =1
N
= ∑ s n2
∑E
n =1
n
n =1
Where E= Loan Exposure Amount (Rs. Cr.) and s= loan share to total.
The HHI is calculated by summing the squares of the portfolio share of
each contributor.
Theoretically, a perfectly diversified portfolio of 500 borrowers would
have HHI = 0.002. In contrast, if the bank portfolio is divided amongst
five zones in the ratio of 5:2:1:1:1, then the implied HHI by sector is
0.32, indicating a significant level of concentration.
Gini Coefficient Measure of Inequality

The Gini coefficient or Lorenz ratio is a standard measure of inequality or
concentration of a group distribution.
It is defined as a ratio with values between 0 and 1. A low Gini coefficient
indicates more equal income or distribution of loan assets with different
industries/groups, sectors, etc., while a high Gini coefficient indicates more
unequal distribution.
For a portfolio of N loans with exposure shares s1, s2,…., sN, the empirical
Gini coefficient is defined as N
( 2n − 1) s ∑ n
G ( s1 , s2 ,........., sN ) = n =1
−1
N
Therefore, the Gini coefficient: G = 1 − ∑ pi ( zi + zi −1 )

pi is the probability or frequency of no. of borrowers and zi is the loan
share.
A value of Gini coefficient close to zero (45 degree diagonal line - no
inequality) corresponds to a well diversified portfolio and a value close to
one corresponds to a highly concentration portfolio.
A Gini coefficient in the range of 0.3 or less indicates substantial equality,
Gini>0.3 to 0.4 indicate acceptable normality.
Geographic Loan Concentration: Gini
Coefficient Approach
Zone wise Inequality Comparison in Loan Distribution
100.00%
90.00%
80.00%
Central_Z_I
70.00%
Central_Z_II
Cum % of Loan Share
60.00% East_Z
Mumbai_Z
50.00%
North_Z
40.00%
South_Z_I
30.00% South_Z_II
West_Z_I
20.00%
West_Z_II
10.00%
0.00%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00
%
Cum % of Borrow ers
We have used declies to slice the loan zonal loan exposure

distribution of a large Bank
Basic concepts in Probability
Probability is a numerical measure of the likelihood that an event will

occur out of all possible outcomes in the experiment.
Sample space is the set of all experimental outcomes
Probability of an event is greater than or equal to 0 and less than or
equal to 1
The probability of an entire sample space is 1
Probabilities under conditions of
Mutual exclusion
Mutual non-exclusion
22
Probability Axioms
Marginal probability
P(A)=relative frequency of occurrence
Addition law: Probability of either of 2events occurring

P(A U B)=P(A)+P(B)- P(A ∩ B)
Joint Probability: Probability of 2 events both occurring

P(A ∩ B) = P(A)*P(B) if they are independent
Conditional probability: Probability of an event given that another
event has occurred
P(A/B)=P (A ∩B)/P(B)
23
Few examples
Tossing an unbiased coin- H, T - r=1, s=1

P(A)=s/(s+r ), P(B)=r/(s+r)
P(A)+P(B)=1
Tossing 2 unbiased coins- TT, TH, HT, HH r=2 s=2; P(A=one

H & one T: composite event)= 2/4
Prob (Both Heads): P(A)=1/4
Prob (at least one Head)=3/4
Similarly, when a dice is thrown, there are six possible
outcomes: 1,2,….,6.
Find Prob(Dice giving even no.)
24
Few examples
What is probability that either of the two coins give heads ?

1/2 + 1/2 - (1/2*1/2) = 3/4 or .75
Probability of A & B to default is .25% & .5% and probability of both
default is 0.35% respectively. What is the probability that A or B might
default
.25%+.5%-(0.35%)= 0.40% where both events are not independent.
What would be the probability if they are independent?
25
Drawing without replacement

A loan portfolio contains 8 solvent accounts and 5 defaulted
accounts. Two successive draws of 3 accounts are made without
replacement. Find the probability that the first drawing will give 3
defaulted and the second 3 solvent facilities.
Soln. Let A denote the event of first drawing gives 3 defaulted loans
and B denotes the event second drawing gives 3 solvent loans. We
have to find out Prob (A ∩ B).
We know by probability theorem P(A∩ B)=P(A)×P(B/A)
P(A)=[8C0×5C3]/13C3=[1×10]/286=5/143=3.45% (approx.)
Next, we have to find
P(B/A)=[8C3×2C0]/10C3=[56×1]/120=7/15=46.68% (approx.)
Hence, required probability is
P(AB)=(5/143)×(7/15)=7/429=1.63%
This concept has major applications in Op-Risk or Credit Risk
Portfolio Modeling Exercises.
26
Conditional Probability (without
replacement): Example 2
A box contains five yellow balls and two green balls. What is
the probability that three balls randomly taken from the box
(without replacement) all will all be yellow?
A= first ball is yellow
B=Second ball is yellow
C=third ball is yellow
P(A ∩ B ∩ C= P(A) P(B/A) P(C/A ∩ B)
P(A)=5/7 i.e. 5 yellow balls in a box of 7
P(B/A)=4/6 i.e. 4 yellow balls left in a box of 6
P(C/A ∩ B)=3/5 i.e. 3 yellow balls left in a box of 5
Thus:
P(A ∩ B ∩ C)=5.4.3/7.6.5
=2/7
27
Condition Probability (with replacement):

Example
In a certain repeated experiment, the possibility of occurrence of an
event is p and consequently the probability of non-occurrence is 1-p=q.
In n repeated trials of the experiment the prob. of occurrence of an
event of r times is:
P (r)=nCrprqn-r (Follows Bernoulli or Binomial distribution)
Ex. If 4% of NPAs are present in a loan pool, determine the probability
that out of 4 borrowers chosen at random at most 2 will be defaulting?
Hints: P(r<=2)=P(r=0)+P(r=1)+P(r=2)
Where n=4 and p=4% and q=96%.
This concept has major applications on Risk Loss Simulation
Exercises
28
Conditional Probability: Example1
From two sets of portfolios A and B with shares yielding Profit

as well as loss, what is the probability of picking a profit yielding
share of portfolio A?
What is the probability of picking loss making share given that

the share is from portfolio A?
29
Conditional Probability: Ex2

Probability of Transaction Errors=0.53
Probability of System Fail=0.50
Prob(Both Fail)=0.27
Therefore, Prob(Trans_error|System Fail)=0.27/0.50=0.54
30
Conditional Probability: Bayes’ Theorem
The conditional prob. P(Bi/A) of a specified event Bi, when A is stated
to have actually occurred, is given by:
P( Bi ) × P( A / Bi )
P( Bi / A) = n
∑ P( B ) × P( A / B )
i =1
i i
Ex. In a bolt factory, machines A, B and C manufacture respectively

25%, 35% and 40% of the total. Of their output 5, 4, and 2 per cents
are defective bolts. A bolt is drawn at random from the product and is
found to be defective. What are the respective probabilities that it was
manufactured by machine A, B, C?
Practical Applications: Through a credit scoring model (say z score),
once a randomly selected borrower obtain z score from its financial
ratios, the above theorem helps us to classify him/her with that score.
That means Bayes’ theorem will assist us to find which group borrower
will fall and with what probability (i.e. whether it will be in defaulting or
solvent type of customer?)
31
Probability Distribution
In reality, there are an infinite number of possible outcomes for
the asset value. We represent the distribution of these possible
outcomes with a probability density function (which is linked to
the histogram).
The next figure shows a typical probability density function for
credit losses. Along the x-axis is the value of the assets. The
height of the function in the y-axis gives the probability of any
given loss occurring.
The higher uncertainty in the asset value increases the
probability of defaulting on the debt (for bond issuer/bank).
32
Results of 10 Credit-Loss Scenarios
Scenario Asset Value

1 96.5
2 98.4
3 100.6
4 101.7
5 102.3
6 103.2
7 103.9
8 104.4
9 104.7
10 105.2
Results of 10 possible scenarios for

asset values at the end of one year 33
Number of Occurrence in Each Range
Range Occurrence per bin

96-98 1
98-100 1
100-102 2
102-104 3
104-106 3
Table showing the no. of results that
fall in each range/bin of possible asset value
34
Histogram of 10 Credit-Loss Scenarios
Series: Asset_Value
.15
Observations 10
Mean 102.09
.1
Median 102.75
Maximum 105.2
Density
Minimum 96.5
Std. Dev. 2.859856
.05
Skewness -0.805147
Kurtosis 2.486903
Jarque-Bera 1.190132
0
96 98 100 102 104 106

Asset_value
Probability 0.551526
35
Histogram of Credit Loss Scenarios

The histogram gives us a crude indication of the probability
distribution for the asset value. For example, it shows us that
there is a 20% chance that the asset value will be less than Rs.
100 (phase value).
36
Probability Density for the Credit-Loss
Example
. 15
.1
D ensity
. 05
0
96 98 100 102 104 106

Asset_value
37
Cumulative Probabilities
While the probability density tells us the probability of a variable
falling in a given range, cumulative probability depicts the
probability of the random variable falling below a given number.
The cumulative probability can be estimated by multiplying the
probability density by the bin width to get probabilities for each
bin, and by summing up all the probabilities for values less than
equal to (less than type ogive) or more than equal to that
number (more than type ogive)
38
Cumulative Probability for the Credit-Loss
Example (less than type Ogive)
1
.8
.6
cum
.4
.2
0
96 98 100 102 104 106

Asset_value
39
Measure of Relative Location
Z-scores or standardised values is the no. of standard deviations

x¡is from the mean
zi= (x¡- x )/sd
-
Chebychev’s theorem enables us to estimate the proportion of

data values that must be within a specified no. of SD from the
mean
At least (1-1/z2) of the data vales must be within z SD of the
mean, where z is greater than 1.
40
Normal Probability Distribution
Normal probability distribution: A continuos probability distribution,

its probability density function is bell shaped and determined by its
mean and SD "
Normal pdf is a good model for a continuos random variable whose
values depends on a no. of factors, each exerting a comparatively
small influence
normal pdf is symmetric around mean/median/mode
probability of obtaining a value far away from mean becomes
progressively smaller
68.26%, 95% , 95.44%, 99% & 99.73% of area is covered by 1,
1.96, 2, 2.58 and 2.99 SD respectively.
41
Normal Distribution
If we’d measure very accurately a randomly distributed
characteristic in a very large sample of cases, we’d obtain a
frequency distribution which is symmetric and in which most cases
cluster around the mean.
42
Standard Normal Distribution
Standard Normal Distribution:A normal probability distribution with
mean 0 and SD 1
Normal distributions differ from one another in terms of mean and
SD
Comparison of 2 normal distributions possible through
standardization
New variable Z may be created from the normal distribution with
mean=o and SD=1. Where,
Z=(Xi -X )/ SD
Standard normal distribution can be used to compute the various
confidence intervals of probable price/loss/return ranges.
Most of the VaR models in calculating economic capital use loss
distribution follows standard normal distribution. Many statistical
credit scoring models also assumes error term follows standard
normal distribution.
43
Examples
Given that the daily change in price of a security follows the normal
distribution with a mean of 70 bps and a variance of 9. What is the
probability that on any given day the change in price is greater than
75 bps.
Z= (75-70)/3 =1.67
P(X>75)=P(Z>1.67)
=1-P(Z<1.67)= 1-0.9525=0.0475
Now estimate:
Probability of change in price being 75 or fewer
Probability of change in price being between 65 and 75 bps
Probability of change in price being less than or equal to 60 bps
44
Confidence Intervals for Standard Normal
Distribution
Normal distribution with 0 mean and 1 standard deviation is
called a Standard Normal Distribution.
In risk management, confidence levels are often more useful
than confidence intervals because we are usually concerned
with the downside risk or worst-case level (tail risk).
It is a single number and level (α) that will not be exceeded,
with a given probability (%).
For example, there is only a 5% chance that a variable drawn

from a Standard Normal distribution will have a value greater
than 1.64.
We can therefore say that the 95% confidence level for this
variable is 1.64 (α). The inverse of the confidence level (α) is
the percentile.
45
Confidence Interval…Example
−
Suppose the mean operational loss X =$434,045 and set
confidence multiplier α=5% so that we have a (1- α)=95%
confidence interval around the estimate of mean, Such an
interval can be calculated using:
−
X ± z α × Stdev(X)
Stdev(X), the standard deviation of X=$73,812, and z is the

standard normal variable for α=5%. Using the Normsinv( )
function, we see that Normsinv(0.95)=1.64 (Or see the
standard normal table). Therefore, we can set z=1.64 and
calculate 95% confidence interval as $312,635 to $555,455.
In this case, the OR manager may feel comfortable stating that
the average OR loss as $434,045, although we have 95%
confidence that the actual (population) value will lie somewhere
close to this value, say, between $312,635 and $555,455.
46
Confidence interval for calculating
average defaults
A Bank finds that the defaulted housing loan facilities have a length of survival
life is approximately normally distributed, with mean equal to 600 days and a
standard deviation of 40 days. Find the probability that a random sample of 16
defaulted loans will have an average life of less than 550 days.
Here, mean=600 and SD=40/√16=10. The desired probability is given by the
area of the shaded region in the figure below.
−
X
Corresponding to 550, we find Z=(550-600)/10=-5 and therefore, Pr( <550 )

=Pr(Z<-5)=normsdist(-5)=0.00003%
47
Confidence Interval & Precision about

Prediction
A confidence interval is an interval constructed from a sample, which

includes the parameter being estimated with a specified probability
known as the confidence level.
If a risk indicator for example was sampled on many occasions, and the
confidence interval calculated each time, then (1-α)% of such intervals
would cover the true population parameter being estimated. Therefore,
the width of the confidence interval measures how uncertain we are
about the unknown population parameters.
A very narrow interval indicates less uncertainty (or less error rate)
about the value of the population parameter than a very wide interval.
It is important to note that C.I is a function of a sample, it is itself a
random variable and will therefore vary from sample to sample.
48
Example: Credit Risk: Bond Default Rates
over 19 Years
Year Bond Default Rate (bp)
1982 125
1983 68
1984 84
1985 99
1986 175
1987 93
1988 146
1989 151
1990 256
1991 297
1992 121
1993 47
1994 52
1995 91
1996 43
1997 52 Source: S & P’s Credit
1998 116 Week, Jan31,2001
1999 198
2000 212 49
Histogram of Bond Default Losses
Series: Loss_rate_bsp
10
Observations 19
8
Mean 127.6842
Median 116
Frequency
6
Maximum 297
Minimum 43
4
Std. Dev. 72.44619

Skewness 0.844582
2
Kurtosis 2.880478
0
50 100 150 200 250

Loss_Rate_bsp
50
Descriptive Statistics of Credit Loss
Series: Asset_Value
.15
Observations 10
Mean 102.09
Median 102.75
.1
Maximum 105.2
Density
Minimum 96.5
Std. Dev. 2.859856
.05
Skewness -0.805147
Kurtosis 2.486903
0
96 98 100 102 104 106

Asset_value
51
Bank’s Loan Loss Distribution
400
Series: HIST_LGD
Sample 1 829
Observations 829
300
Mean 0.751924
Median 0.937150
Maximum 1.000000
200
Minimum 0.000000
Std. Dev. 0.323241
Skewness -1.160426
100 Kurtosis 3.063549
0
0.0 0.2 0.4 0.6 0.8 1.0
52
Fitting Beta Distribution to Loan Loss
BetaGeneral(0.35405, 0.15230, 0.0000, 1.0000)
25
Fit-Test Input
Function RiskBetaGeneral(0.35405, 0.1523, 0, 1) N/A
a1 (location) 0.354048284 N/A
a2 (Scale) 0.15229666 N/A
20 min 0 N/A
max 1 N/A
Mean 0.69922 0.75192
Mode N/A 1
Median 0.93079 0.93715
15 Std. Deviation 0.37365 0.32324
Variance 0.13962 0.10436
Skewness -0.8509 -1.1604
Kurtosis 2.0652 3.0635
10
0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
5.0% 90.0% 5.0%

0.005 1.000
53
Fitted Loss Distribution through

Simulation
BetaGeneral(0.34846, 0.16739, 0.0000, 1.0000)
25
Fitted Input
20 Function RiskBetaGN/A
a1 0.34846 N/A
a2 0.167393 N/A
15 min 0 N/A
max 1 N/A
Mean 0.6755 0.69935
Mode N/A 1.0000 [est]
10
Median 0.89737 0.93
Std. Deviat 0.38027 0.37397
Variance 0.1446 0.13972
5
Skewness -0.7338 -0.8509
Kurtosis 1.8714 2.0657
0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
5.0% 90.0% 5.0%

0.004 1.000
54
Market Risk Example: Histogram of
Daily Returns for S&PCNXNIFTY over a
5-year period
500
Series: SNP_RETURN
Sample 1 1275
400 Observations 1275
Mean 0.001205
300 Median 0.002188
Maximum 0.079691
Minimum -0.130539
200 Std. Dev. 0.014263
Skewness -1.088501
Kurtosis 11.35109
100
0
-0.10 -0.05 0.00 0.05
55
Hypothesis Testing
Testing of hypothesis is one of the main objectives of Sampling
Theory. Hypothesis tests address the uncertainty of the sample
estimate.
When we have to make a decision about the entire population
based on the sample data, hypothesis tests help us in arriving at
a decision.
It attempts to refute a specific claim about a population
parameter based on the sample data.
The process which enables us to decide on the basis of the
sample results whether a hypothesis is true or not, is called Test
of Hypothesis or Test of Significance.
56
Hypothesis Testing Procedure
All hypothesis tests are conducted the same way. The researcher
states a hypothesis to be tested, formulates an analysis plan,
analyzes sample data according to the plan, and accepts or rejects
the null hypothesis, based on results of the analysis.
State the hypotheses. Every hypothesis test requires the analyst
to state a null hypothesis and an alternative hypothesis. The
hypotheses are stated in such a way that they are mutually
exclusive. That is, if one is true, the other must be false; and
vice versa.
Formulate an analysis plan. The analysis plan describes how to
use sample data to accept or reject the null hypothesis. It should
specify the following elements.
• Significance level. Often, researchers choose significance levels
equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can
be used.
57
One-tailed test vs. Two-tailed Hypothesis

Testing
One-Tailed Test
A test of a statistical hypothesis , where the region of rejection is on
only one side of the sampling distribution , is called a one-tailed test. In
such tests, we are only interested in values greater (or less) than the
null. A one sided hypothesis test is as follows:
Test H0: k=0 against HA: k>0 or k<0 & we reject the null if | Tcomp
|>Tcritical
Two-Tailed Test
A test of a statistical hypothesis , where the region of rejection is on
both sides of the sampling distribution , is called a two-tailed test. In
such tests, we are interested in values greater and smaller than the null
hypothesis.
We write this as:
Test H0: k=0 against HA:k≠0 & we reject the null if | Tcomp |>Tcritical
In the two-sided hypothesis, we calculate critical value using α/2. For
example, α=5%, the critical value of the test statistic is T0.025. 58
Problem1: Two tailed test
An inventor has developed a new, energy-efficient lawn mower
engine. He claims that the engine will run continuously for 5
hours (300 minutes) on a single gallon of regular gasoline.
Suppose a simple random sample of 50 engines is tested. The
engines run for an average of 295 minutes, with a standard
deviation of 20 minutes. Test the null hypothesis that the mean
run time is 300 minutes against the alternative hypothesis that
the mean run time is not 300 minutes. Use a 0.05 level of
significance. (Assume that run times for the population of
engines are normally distributed.)
Null hypothesis: µ = 300
Alternative hypothesis: µ ≠ 300
Note that the null hypothesis will be rejected if the sample mean is
too big or if it is too small.
59
Solution1: Two-tailed test

Analyze sample data. Using sample data, we compute the standard
error (SE), degrees of freedom (DF), and the t-score test statistic (t).
SE = s / sqrt(n) = 20 / sqrt(50) = 20/7.07 = 2.83
DF = n - 1 = 50 - 1 = 49
t = (x - µ) / SE = (295 - 300)/2.83 = 1.77
where s is the standard deviation of the sample, x is the sample mean,
µ is the hypothesized population mean, and n is the sample size.
Since we have a two-tailed test, the P-value is the probability that the
t-score having 49 degrees of freedom is less than -1.77 or greater than
1.77.
We use the t Distribution Calculator to find P(t < -1.77) = 0.04,
and P(t > 1.75) = 0.04. Thus, the P-value = 0.04 + 0.04 = 0.08.
Interpret results. Since the P-value (0.08) is greater than the
significance level (0.05), we cannot reject the null hypothesis.
60
Problem2: One-tailed test
Bon Air Elementary School has 300 students. The principal of
the school thinks that the average IQ of students at Bon Air is
at least 110. To prove her point, she administers an IQ test to
20 randomly selected students. Among the sampled students,
the average IQ is 108 with a standard deviation of 10. Based on
these results, should the principal accept or reject her original
hypothesis? Assume a significance level of 0.01.
Null hypothesis: µ = 110
Alternative hypothesis: µ < 110
Note that these hypotheses constitute a one-tailed test. The null
hypothesis will be rejected if the sample mean is too small.
61
Solution2: One-tailed test

Analyze sample data. Using sample data, we compute the standard
error (SE), degrees of freedom (DF), and the t-score test statistic (t).
SE = s / sqrt(n) = 10 / sqrt(20) = 10/4.472 = 2.236
DF = n - 1 = 20 - 1 = 19
t = (x - µ) / SE = (108 - 110)/2.236 = -0.894
where s is the standard deviation of the sample, x is the sample mean,
µ is the hypothesized population mean, and n is the sample size.
Since we have a one-tailed test, the P-value is the probability that the
t-score having 19 degrees of freedom is less than -0.894.
We use the t Distribution Calculator to find P(t < -0.894) = 0.19. Thus,
the P-value is 0.19.
• Interpret results. Since the P-value (0.19) is greater than the
significance level (0.01), we cannot reject the null hypothesis.
62
Hypothesis Testing: Bond Loss Example 1
Hypothesis Testing for LOSS_RATE_BSP
Date: 10/24/07 Time: 12:50
Sample: 1 19
Included observations: 19
Test of Hypothesis: Mean = 128.0000

Assuming Std. Dev. = 72.44619
Sample Mean = 127.6842

Sample Std. Dev. = 72.44619
Method Value Probability

Z-statistic -0.019 0.9848
t-statistic -0.019 0.985
63
Example 2: Bond Loss
Hypothesis Testing for LOSS_RATE_BSP

Date: 10/24/07 Time: 13:03
Sample: 1 19
Test of Hypothesis: Mean = 80.00000
Sample Mean = 127.6842

Sample Std. Dev. = 72.44619
Method Value Probability

t-statistic 2.869035 0.0102
64
Parametric-Mean Difference Test
Many problems arise where we wish to test hypotheses about the means of two
different populations (e.g. comparing ratios of defaulted and solvent firms or
comparing performance of public sector bank vis a vis private banks etc.)
Un-Paired test: Or,
Start by assuming H0 is true and use the following test statistic to arrive at a
decision:
A low p value (<0.05) will Reject the null and a high p value (>0.10)
will fail to reject the null.
65
Ex: Difference between Solvent &

Defaulted Group of Borrowers
Variable Name Mean
Solvent Defaulted t-test for

Difference$
PROPERTY AREA (SQ. METER) 101.67 65.99 35.68**
(23.81)
GROSS MONTHLY INCOME (RS.) 20,443.30 9,711.90 10,731.40**
(28.56)
AGE_BORR 43 45 -1.79**
(-12.42)
NO_DEPEND 1.445 1.744 -0.2988**
(-12.57)
LN_ASSTVAL 12.75 11.95 0.798**
(22.15)
SECVAL_LOANAMT 1.65 1.50 0.15**
(10.05)
NO_CO_BORR 0.48 0.31 0.174**
(18.70)
COBOR_MINC 3061.04 10,24.64 2036.4**
(12.30)
ORGNL_TERMM 173.26 176.4 -3.14**
(-3.8)
No. of observations 7321 6166
66
Errors of Testing
There are two kinds of errors that can be made in significance
testing: (1) a true null hypothesis can be incorrectly rejected and
(2) a false null hypothesis can fail to be rejected.
The former error is called a Type I error and the latter error is
called a Type II error.
True State of the Null Hypothesis

Statistical Decision
H0 True H0 False
Reject H0 Type I error Correct
Do not Reject H0 Correct Type II error
The probability of a Type I error is designated by the Greek

letter alpha (α) and is called the Type I error rate; the probability
of a Type II error (the Type II error rate) is designated by the
Greek letter beta (ß). 67
Relationship Between Alpha, Beta and

Power
68
Example: Classification Power of a
Statistical Scoring Model
Table: Classification Power of the Logistic Model 1 for the Holdout Sample of the
year 2003 & 2004
Predicted Group
Original Defaulted Solvent Total
Group Defaulted 47 3 50
(94%) (6%) (100%)
Solvent 8 42 50
(16%) (84%) (100%)
Note: Figures in the parentheses denote percentages.
69
Testing The Power of Credit Risk Models
% Correct Classification
Model Within Sample
Good Bad
Alman Z-Score 1968 Reworked with Indian Data 84.00% 82.00%
Emerging Market Z-Score 1995 Reworked with Indian
Data 88.20% 75.90%
NIBM Z-Score 2005 Developed on Indian Data 85.20% 91.00%
70
Calibrating & Benchmarking A Model
•Take a look at the above two graphs showing score-wise distribution

of bankrupt and non-bankrupt category of borrowers.
•The first graph has substantial overlapping of observations making it
difficult to predict failure of large number of borrowers, while the
second graph has very less overlapping area between the two
categories.
71
Types of Probability Distributions
Discrete (for Events Prediction) & Continuous

(for Losses)
Binomial, Poisson, Bernoulli, Negative
Binomial...
Normal, Beta (Credit Risk, Market Risk) t, Chi-
sq. , Beta, Exponential, Weibull (Extreme
Distribution-Thick Tail), ...
72
Popular Discrete Distributions: Rule of
Thumb for Identifying Them
Binomial Distribution, Geometric Distribution and Negative
Binomial Distribution
A useful rule of thumb for choosing between these popular
distributions is:
Binomial: variance<a. m
Poisson: variance=a. m
Negative Binomial: variance>a. m
• Thus, if we observe that our sample variance is much larger than

the sample mean, the negative binomial distribution may be an
appropriate choice.
73
Frequency Distributions
Poisson Distribution:
Number of Frauds λ= 102
x
e−λλk
f (x) = ∑
January February March April May June July August
95 82 114 74 79 160 110 115 91% k=0 k!

118 95%
126 99%
Poisson
Poisson PDF Poisson CDF
4.50% 100.00% Other popular

4.00% 90.00%
distributions to
80.00%
3.50%
70.00%
estimate frequency
3.00%
2.50%
60.00% are the geometric,
2.00%
50.00%
negative binomial,
40.00%
1.50%
30.00%
binomial, weibull, etc
1.00%
20.00%
0.50%
10.00%
0.00%
0.00%
0 50 100 150 200 0 20 40 60 80 100 120 140 160
74
Binomial Distribution
N= 12
p= 0.8
N!
f (x) = px (1− p)x
x!(N − x)!
Probability
Mean=N p
and
Standard Deviation:
σ = Np(1 − p)
1
9
11
13
15
17
19
21
23
25
27
29
Number of events
x
The parameter p can be estimated by: pˆ =
N 75
Summary of Frequency of Loss Daily Data for

Credit Card Fraud
Poisson Distribution:
No. of events Observed i x ni
per day (i) frauds (ni)
0 19 0 x
e−λλk
1 16 16 f (x) = ∑
2 51 102 k=0 k!
3 9 27
4 6 24 E=2.71828…
5 5 25 x=0,1,2,…,
6 4 24
7 6 42
8 2 16
9 1 9 Here, mean (Lambda)
10 0 0 λ =sum(ixni )/sum(ni)
11 0 0 =352/124=2.84
12 2 24
13 1 13 Here, SD=
14 0 0 √(2.84)=1.68523
15 2 30
Total: 124 352 76
Distribution of Credit Card Fraud Events
Distribution of Frauds frequency
60
50
40
O b served F rau d s
30
20
10
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
no of events per day
77
Fitted Poisson Values for Credit Card

Frauds
Lambda (λ=2.84)
No. of Events Probability Possion Parameter
0 5.84% 2.84
1 16.59%
2 23.56%
3 22.31%
Probability Fitted Poisson
4 15.84%
5 9.00% 25.00%
6 4.26%
7 1.73%
8 0.61% 20.00%
9 0.19%
10 0.05%
Probability
11 0.01% 15.00%
12 0.00%
13 0.00%
14 0.00% 10.00%
15 0.00%
16 0.00%
17 0.00% 5.00%
18 0.00%
19 0.00%
20 0.00%
0.00%
21 0.00%
22 0.00%
10
12
14
16
18
20
22
24
0
2
4
23 0.00% 78
24 0.00% No. of Events
Chi-Sq. Goodness of Fit Test
The risk manager should run a fit test to confirm the right
selection of distribution.
One such test is: Chi-squared goodness of fit test:
~ n (Oi − Ei ) 2
T =∑
i =1 Ei
H0: The data follows a specified distribution (here Poisson)
Ha: The data do not follow the specific distribution
The test statistic is calculated by dividing the data into n bins
(or ranges) and is defined as:
Where Oi is the observed no. of events, Ei is the expected no.
of events (or fitted), and n is the no. of categories.
D.f=n-(k-1), where k refers to the no. of parameters that need
to be estimated.
79
Example: Key Personnel Risk

No. of Back Office Staff Leaving per Month
Number leaving Observed ixn
per month (i) (ni)
0 18 0
1 20 20
2 21 42
3 11 33
4 4 16
5 1 5
Total 75 116
Mean (Lambda) 1.55
•We fit a Poisson distribution to the above data. The parameter λ is

estimated at 1.55 (one can check).
•It tells that there has been a constant turnover in staff of between one
or two people per month.
80
Actual Poisson Fitted Values & Back Office
Turnover Risk
Poisson Fitted Vs Actual Series2

Observed
Series3
Poisson
30
25
Frequency
20
15
10
5
0
0 1 2 3 4 5
Nos Leaving per Month
The Poisson Distribution appears visually fit the data fairly well. 81
Chi-Sq-Goodness of Fit Result

Observed & Actual Number of Back Office Staff Leaving Per Month
Numbers leaving Observed (O) Fitted Prob Expected (E) sum((O-E)^2/E)
per month (i) (ni)
0 18 21.23% 16 0.27210689
1 20 32.90% 25 0.88522513
2 21 25.50% 19 0.184441167
3 11 13.17% 10 0.127023463
4 4 5.11% 4 0.007659566
5 1 1.58% 1 0.029315002
15 75
Chi2 T-curl 1.51
Test Stat
d.f 5
Critical T-curl 5% 11.07049775
The chi2 test statistics T-curl=1.51, which is less than the critical value of 11.07 at 5
percent significance with 5 degrees of freedom (n-1=6-1=5), we fail to reject the
null hypothesis and conclude that there is no evidence to support the alternative
hypothesis that the observed distribution is significantly different from the expected
(poisson) distribution. [In excel use chiinv(p,df) formula to obtain the critical value]
Hence, Poisson distribution fits the data fairly well. 82
Testing the Fitness of Continuous
Distributions
Jarque-Bera Statistics-Tests the normality of a distribution

Kolmogorov-Smirnov Test-Identify the Fat Tails
Anderson-Darling Test-Best Fits Extreme Distributions
Scheartz Criteria
Akaike Information Criteria
Graphical-Quantile-Quantile (Q-Q) Plots, Probability-Probability
Polots (P-P).
83
Confidence Interval for Loss Prediction
Using Chebyshev’s theorem we can say that at least

¾ (or 75%) of loss events (credit card fraud) fall in
the interval: λ+2σ.
For credit card events (in our example-credit card
fraud), 75% of the observations will be from 0 to 6
since C.I: λ+2σ=2.84±(1.685)×(1.685), or
Similarly, in case of key personnel risk, 75% of the
observations will be from 0 to 3.
Similar concept applies in choosing sample from the
population for making statistical inference.
84
Kolmogorov-Smirnov Test (K-S)
Kolmogorov-Smirnov goodness of fit test that whether
a set of data come from a hypothesized continuouis
distribution.
It tends to be more sensitive near the center of the
distribution than it is at the tails.
H0: The data follow the specified distribution. Ha: The
data do not follow the specified distribution.
Test Statistics:
Where F(Y) is the theoretical fitted distribution
i/N is the actual data distribution.
The hypothesis regarding the distributional form is
rejected if the test statistic, D, is greater than the
critical value obtained from a table.
You can run this test in Best Fit, Easy Fit, Data Plot
softwares
85
Anderson-Darling Test
Anderson-Darling goodnes of fit test whether a data set comes from a
specified distribution.
It is a modification of the Kolmogorov-Smirnov (K-S) test and gives
more weight to the tails than the K-S test.
The K-S test is distribution free in the sense that the critical values do
not depend on the specific distribution being tested.
The Anderson-Darling test makes use of the specific distribution in
calculating critical values. This has the advantage of allowing a more
sensitive test and the disadvantage that critical values must be
calculated for each distribution.
You can run this test in Best Fit, Easy Fit, Data Plot softwares
More formally, the test is defined as follows.
H0: The data follows a specified distribution.
Ha: The data do not follow the specified distribution
For Test Statistic, See Statistics Book
86
Severity Distribution: Legal Liability Loss
60
Descriptive Statistics of Legal

Liability Losses (in British Pound)
Mean 151944
Median 103522.9
Standard Deviation 170767.1
40
Skew 2.8064
Kurtosis 15.3145
Percent
No. of Obs 140

20
0
0 500000 1000000 1500000

legl_loss
87
Normal Probability Plot for Legal Event

Losses
Normal(151944, 170767) Normal(151944, 170767)
1.0 1.4
1.2
0.8
1.0
0.8
Fitted p-value
Values in Millions
0.6
Fitted quantile
0.6
0.4
0.4
0.2
0.2 0.0
-0.2
0.0
-0.4
0.0
0.2
0.4
0.6
0.8
1.0
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Input p-value
Input quantile
Values in Millions 88
Exponential Probability Plot for Legal
Event Losses
Expon(149190) Shift=+1688.6 Expon(149190) Shift=+1688.6
1.0 1.4
1.2
0.8
1.0
Values in Millions
Fitted p-value
Fitted quantile
0.6
0.8
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
0.0
0.2
0.4
0.6
0.8
1.0
Input p-value
Input quantile 89
Values in Millions
Fitted Exponential Distribution

Expon(149190) Shift=+1688.6
7
Fitted Actual
6 Function RiskExpon(1N/A
Shift 1688.58848 N/A
5
b 149189.812 N/A
Minimum 1688.6 2754.2
Values x 10^-6
Maximum Plus infinity 1255736

4
Mean 150878 151944
Mode 1688.6 13551 [est]
3 Median 105099 103523
Std. Deviation 149190 170767
Variance 2.23E+10 2.90E+10
2
Skewness 2 2.8064
Kurtosis 9 15.3145
1
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Values in Millions
0.009
90.0%
0.449
5.0% >
90
Fitted Weibull Distribution to Cover the
Fat Tail
Weibull(1.2154, 192107) Shift=-26732
6
5 Weibull Fit Actual

Function RiskWeibull( N/A
Shift -26731.975 N/A
4 a 1.21540041 N/A
Values x 10^-6
b 192106.533 N/A
Minimum -26732 2754.2
3
Maximum Plus infinity 1255736
Mean 153392 151944
Mode 19533 13551 [est]
2
Median 115363 103523
Std. Deviation 148922 170767
1
Variance 2.22E+10 2.90E+10
Skewness 1.492 2.8064
Kurtosis 6.0945 15.3145
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Values in Millions
< 90.0% 5.0% >

-0.010 0.447
91
Weibull Probability Plot for Legal Event

Losses
Weibull(1.2154, 192107) Shift=-26732 Weibull(1.2154, 192107) Shift=-26732
1.0 1.4
1.2
0.8
1.0
Fitted p-value
Values in Millions
0.8
0.6
Fitted quantile
0.6
0.4
0.4
0.2
0.2
0.0
0.0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Input p-value
Input quantile
Values in Millions
92
Fitting Beta Distribution to Loan Loss
BetaGeneral(0.35405, 0.15230, 0.0000, 1.0000)
25
Fit-Test Input
Function RiskBetaGeneral(0.35405, 0.1523, 0, 1) N/A
a1 (location) 0.354048284 N/A
a2 (Scale) 0.15229666 N/A
20 min 0 N/A
max 1 N/A
Mean 0.69922 0.75192
Mode N/A 1
Median 0.93079 0.93715
15 Std. Deviation 0.37365 0.32324
Variance 0.13962 0.10436
Skewness -0.8509 -1.1604
Kurtosis 2.0652 3.0635
10
0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
5.0% 90.0% 5.0%

0.005 1.000
93
Beta Distribution
The mean of beta distribution is given by:
α αβ
Mean = & S .D. =
; α+β (α + β ) 2 (α + β + 1)
The parameters of this distribution can be easily estimated

using the following (method of moments) equations:
⎡⎛ X (1 − X ) ⎞ ⎤ ⎡⎛ X (1 − X ) ⎞ ⎤
αˆ = X ⎢⎜⎜ ⎟⎟ − 1⎥ βˆ = (1 − X ) ⎢⎜⎜ ⎟⎟ − 1⎥
S2 ⎣⎝ S2 ⎠ ⎦
⎣⎝ ⎠ ⎦ 94
Fitted Loss Distribution through
Simulation
BetaGeneral(0.34846, 0.16739, 0.0000, 1.0000)
25
Fitted Input
20 Function RiskBetaGN/A
a1 0.34846 N/A
a2 0.167393 N/A
15 min 0 N/A
max 1 N/A
Mean 0.6755 0.69935
Mode N/A 1.0000 [est]
10
Median 0.89737 0.93
Std. Deviat 0.38027 0.37397
Variance 0.1446 0.13972
5
Skewness -0.7338 -0.8509
Kurtosis 1.8714 2.0657
0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
5.0% 90.0% 5.0%

0.004 1.000
95
VaR
¾Value-at-Risk is a risk measure (risk capital) which is

generically defined as the maximum possible loss for a
given position or portfolio within a known confidence
interval over a specific time horizon due to certain kind of
risk
96
Correlation and Dependence Analysis
Frequency based Joint Dependence: Using probability
and set theorem-Random sampling with or without
replacement
Pearson Correlation Coefficient(rx,y):
Cov(x,y)/(SDx×SDy)
Spearman’s Rank Correlation Coefficient (ρ ): For
example correlation between salary ratio and gross
income generation for 20 traders.
ρ=1-6Σdi2/(n2-1)n where di are the differences of the
ranked pairs.
97
Econometric Models
Regression model
98
Simple Linear Regression (OLS)
Regression analysis is concerned with the study of relationship between one

variable called the explained or dependent variable and one or more other
variables called independent or explanatory variables.
Linearity implies slope remains constant
In a SLR Model, y= bo + b1x + e
bo and b1 are the parameters and e is a random variable referred to as error
term. ‘e’ accounts for variability in y not accounted by linear relationship
between x and y.
bo is the intercept or constant term b1 is the slope, indicating change in y for 1
unit change x.
Sign of b1 indicates direction of relationship
99
Assumptions of OLS regression
OLS method of estimation minimises the sum of the squared error

terms
Assumptions behind OLS:
Model is correctly specified
Mean value of ui=0
Equal variance ui (Homoscedasticity)
No autocorrelation or no correlation between ui and uj
Zero covariance between Xi and ui
No multicollinearity Cov(Xi Xj)=0 , multivariate regression
100
Regression Analysis -- OLS
18
16
Y 14
12
10
8
6
4
2
0
0 10 20 30 40 50
X
Ordinary Least Squares (OLS)
• We have a set of data points, and want to fit a line to the data
• The most “efficient” can be shown to be OLS. His minimizes the
squared distance between the line and actual data points.
101
Regression Analysis -- OLS
Yj = a + b⋅ X j + ε j •The basic equation
b̂ =
∑ (X − X)(Y − Y)
i i •OLS estimator of b hat
∑ (X − X) i
2
â = Y − b̂X •OLS estimator of a hat
Here a hat denotes an estimator, and a bar a sample mean.

102
Regression Analysis -- Confidence
R =
2 ∑ (Ŷ − Y) = (correlation)
i
2
2
∑ (Y − Y)
i
2
S . E(b̂) =
∑ (Y − Ŷ ) /(n − 2) = RSS/(n − 2)
i i
2
∑ (X − X) i ∑ (X − X)
2
i
2
RSS ⎡ 1 X2 ⎤
S . E(â) = Variance(â) = ⎢ + ⎥
n − 2 ⎣⎢ n ∑ X i2 ⎦⎥
Here, the R-squared is a measure of the goodness of fit of our model, while
the standard error or deviation of b gives us a measure of confidence for out
estimate of b. 103
Significance test for regression

coefficients
T-test in regression exercise helps us to examine the whether there are any
significant effect of regression parameters on dependent variable
For example, researcher wants to examine the effect of system downtime on
the amount of operational risk or fall in GDP growth on default%.
H0: b=0; a=0
Ha: b≠0; a≠0
Test Statistics:
â − 0 b̂ − 0
t â/ 2 = & t b̂/ 2 =
S . E(â) S . E .(b̂)
A 100×(1-α) confidence interval (C.I) is given by: bˆ ± S .E (bˆ) × tα / 2
Where t follows t distrbn with d.f. n-2
Rule of thumb: If estimated t is greater than the C.I and the resulting error
rate or p value is <0.05, then we reject the null hypothesis and conclude that
independent variable (X) significantly affect the dependent variable (Y).
One can also perform a one tail test to check whether there is a +ve or –ve
relationship between Y & X 104
Overall Significance and Goodness of Fit
Total Sample Variance(TSS) = ESS+ RSS
⇒ ∑i =1 (Yi − Y) 2 = ∑ (Ŷi − Y ) 2 + ∑ (Yi − Ŷi ) 2
n
Again, it can be proved that :

ESS = b̂ 2 × ∑ (X i − X)
2
o The difference between TSS and RSS represents the improvement obtained
by adjusting Y to account for X.
o The measure of goodness of fit R2 can be constructed by taking into ratio of
explained variance to total variance (i.e. R2=(ESS/TSS) or, =1-RSS/TSS.
o For a good fitting model ESS will be large and RSS will be small and R2 will
be large. 105
Regression Analysis – Two-variable
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.976786811
R Square 0.954112475
Adjusted R Square 0.94493497
Standard Error 27.08645377
Observations 7
ANOVA
df SS MS F Significance F
Regression 1 76274.47725 76274.48 103.9621 0.000155729
Residual 5 3668.379888 733.676
Total 6 79942.85714
Coefficients Standard Error t Stat P-value

Intercept 87.10614525 17.92601129 4.859204 0.004636
Q (X) 12.2122905 1.19773201 10.19618 0.000156
bˆ
= the t-ratio.
S .E (bˆ)
Combined with information in critical values from a “student-t”
distribution, this ratio tells us how confident we are that a value is
significantly different from zero. 106
Regression Fit
Q(X) Line Fit Plot

450
400
350
300
TC(X)
250
200
150 TC(X)
100 Predicted TC(X)
50
0
0 10 Q(X) 20 30
107
Interpretation of model parameters

Operational Losses and System Downtime
System
Operational downtime
Date losses ($) (minutes)
1-Jun 1610371 9
2-Jun 25677 0
3-Jun 1504852 11
4-Jun
5-Jun 913881 7
6-Jun 2352458 18
7-Jun 3549325 19
8-Jun 0 0
9-Jun 0 0
10-Jun 1649917 13
Using the above data, we estimate the regression equation using OLS:
Operational Loss=-$40,526 +[$155,470×system downtime]; R2=0.931413
(176688.8) (15945.9)
(-0.22937) (9.749864) Adj R2=0.921615
F-Stat=95.05984 108
Regression: Interpretation
Operational_loss day i=[intercept]+[slope × system downtime day i]
+[random error day i]
Simple linear regression is a conditional expectation.

The slope parameter “β” measures the relationship between X
(independent) and Y (dependent variable). It is interpreted as the
expected change in Y for a 1-unit change in X.
For example, if we estimate a regression and find:
E(Y/X)=2.75+0.35X, a 1 unit change in X is expected to lead to a 0.35
unit change in Y.
109
Statistical Significance Test of Regression

Coefficient
In the op-risk regression example, one can check: does system
downtime affect the amount of operational losses?
We may, if we wish, test this formally with H0: β0=0 against H1: β0>0
The S.E. and t value with a resulting p value will help us to test the null
hypothesis (one tail test)
Regression Interpretation: Ex2
JPMC has used the following regression equation using 15 years data
points:
Loss Given Default Rate=1.16+0.16Xln(Default Rate)
(0.0001) (0.0069)
R2=0.44 and Adjusted R2=0.40
Source: M Araten, M Jacobs Jr. and P Varshney, (May 2004), “Measuring LGD on
Commercial Loans: An 18-Year Internal Study, RMA
111
Multivariate Regression Analysis

Regression with two or more X variables
All variables have to be continuous (at least interval level), X
variables can also be dichotomous dummy-variables
How does a group of X variables affect Y variable?
Regression equation:
In logistic regression the dependent variable is dichotomous

(e.g. yes/no) 112
Multivariate regression
X k 1 ⎤ ⎡ β 1 ⎤ ⎡ u1∧ ⎤
∧
⎡ Y1 ⎤ ⎡1 X 21
⎢Y ⎥ = ⎢1 ⎢ ∧⎥ ⎢ ⎥
⎢ 2⎥ ⎢ X 22 X k 2 ⎥⎥ ⎢ β 2 ⎥ + ⎢ u 2∧ ⎥
X kn ⎥⎦ ⎢⎣ β 3 ⎥⎦ ⎢⎣ u 3∧ ⎥⎦
∧
⎢⎣Yn ⎥⎦ ⎢⎣1 X 2n
y = Xβ + u
β ∧ = ( X / X ) − / X /Y
113
Regression Results
Regression analysis produce the following results:
For the whole regression:
R2 predicts the explanatory power of the regression model
(explained variance/total variance)
ANOVA ( F-test and p values-test for overall goodness of fit)
For each X variable:
Regression coefficients (betas)
Standard error or repressor variance
t-test value for the statistical significance (with p values)
114
R2, Adjusted R2, F statistics for Model Fit
R2=ESS/TSS measures explanatory power of regression model.

R2 (Adj)=1-[(RSS/n-k)/(TSS/n-1)]
n=no. of obs.; k=no. of parameters
F-test examines the goodness of fit of the model
H0: β1= β2= β3=….= βn=0
F=R2/[(1-R2)/n-k]
A high F value and low p (less than 5%) rejects the hypothesis and
then model fits well.
115
The Difference between Correlation and

Regression
Correlation analysis compares how to random variables vary together.
In regression we assume the values taken by the dependent variable
are influenced or caused by the independent variables.
Therefore, regression provides us with a cause-and-effect modeling
framework.
Correlation, on the other hand, informs us that two variables may be
related, but it tells us nothing about causation.
116
Example of regression analysis
117
Coefficients
Interpretation:
Health does not seem to be dependent on sex (P = 0.209 >0.05).
Age, smoking and exercising have significant effect on health.
Age has the strongest effect (Beta = -.316). The older the person, the weaker the
experienced health. Smoking has a negative effect and exercising positive effect on
health.
In total the model is statistically significant and explains 15,5% of total variation in
experienced health.
118
Application of Multiple Regression: Ex1
Operational Los=f (system downtime, no. of trainees working, no. of
experienced staff, volume of transactions, no. of transaction errors)
Dependent Variable: OPLOSS
Method: Least Squares
Date: 11/15/09 Time: 00:18
Sample: 1 10
Variable Coefficient Std. Error t-Statistic Prob.
C 508467.1 359893.5 1.412827 0.25260

SYSTDOWN 162073.9 11343.19 14.28822 0.00070
TRAINEES 42063.33 19657.59 2.139801 0.12190
EXPRSTAFF -41034.77 9035.252 -4.541629 0.02000
TRANSNO 0.556896 4.598436 0.121106 0.91130
TRANSERR -1.074378 40.7226 -0.026383 0.98060
R-squared 0.99457 Mean dependent var 1289609

Adjusted R-squared 0.985521 S.D. dependent var 1203115
S.E. of regression 144767.8 Akaike info criterion 26.83837
Sum squared resid 6.29E+10 Schwarz criterion 26.96985
Log likelihood -114.7727 F-statistic 109.9071
Durbin-Watson stat 1.940854 Prob(F-statistic) 0.001352 119
Logistic Regression
Logistic regression in a nutshell:
It is a multiple regression with an outcome variable (or
dependent variable) that is a categorical dichotomic and
explanatory variables that can be either continuous or
categorical
In other words, the interest is in predicting which of two
possible events are going to happen given certain other
information
For example in Political Science, logistic regression could be
used to analyse the factors that determine whether an
individual participates in a general election or not.
120
Why cannot we use a Simple Linear
Regression?
Let us remember what we have learnt about Simple Linear

Regression:
We used it when we had reasons (a theory) to assume
causality between two variables: X → Y.

Example:
X= Investment in R&D; Y= New Products introduced
In particular, we want the ‘X’ to cause the ‘Y’ and not the
inverse.
121

This sort of regression analysis provides us with useful
information:
E.g.: For a certain confidence level (95%, for example): How
much the explained variable (Y) changes as a result of a

change in the explanatory variable (X)
With a regression we can predict the value of Y given the
value of X
122
How is this impact of X on Y estimated?
We assumed a linear relation between the two variables

We introduced ‘u’, unobserved factors affecting Y, which we
are not going to account for in our model
Then we postulated the following relation:
Yi = α + βXi + ui
123

How is this impact of X on Y estimated?
We made some assumptions about ‘u’

(Basically we assumed that ui are identically and independently
distributed with zero mean and constant variance)
Then we estimated the parameters for the model (using

generally Ordinary Least Squares)
Simple Linear Regression provides the ‘best fit’ line. i.e.: the
straight line which best describes the relationship between the
two variables
124
Our example: R&D and New Products
How does investment in R&D 50
affects the number of new

products developed? We can
40
postulate the following 30
relation:
20
# of new products = α +
β*Investment in R&D + u
NEWPROD
10
0
0 200 400 600 800
Let us look at the scatter plot: RD
125
Our Example: Investment in R+D and

introduction of new products
It makes sense to assume a

linear relation between X
and Y in this case. 50
The estimate for β = 0.049

This tells us that in order to 40
increase the number of new

products in one unit, we 30
need to invest a little bit

more than 20 monetary units 20
in R&D.
If a company invests 1000 in
NEWPROD
10
R&D, we would predict this
company to develop around 0
49 new products 0 200 400 600 800
RD
126
Another example: Failing or Passing an
exam
Let us define a variable ‘Outcome’

Outcome = 0 if the individual fails the exam
= 1 if the individual passes the exam

We can reasonably assume that Failing or Passing an exam
depends on the quantity of hours we use to study
Note than in this case, the dependent variable takes only two
possible values. We will call it ‘dichotomic’ variable
127
Regression analysis with dichotomic

dependent variables
We will be interested then in inference about the probability of

passing the exam.
Were we to use linear regression, we would postulate:

Prob (Outcome=1) = α + β*Quantity of hours of study + u
As we are concerned about modelling the probability of the event

occurring, this is a probability model
As we model the relation between the quantity of hours of study

and the probability of passing the exam as linear, this is a linear
model
We will call this model a ‘Linear Probability Model’ (LPM)

128
Linear Probability Models (LPM)
Our dataset contains Student id Outcome

Quantity of Study
Hours
information about 14 1 0 3
students. 2 1 34
Our statistical software 3 0 17
(SPSS) will happily perform a 4 0 6

5 0 12
linear regression of
6 1 15
Outcome, on the quantity of 7 1 26
study hours. 8 1 29
9 0 14
10 1 58
11 0 2
12 1 31
13 1 26
14 0 11
129
Linear Probability Models (LPM) –

What is wrong about them?
1.2
Let us do a scatter plot and

1.0
insert the regression line:
The probability of Outcome=1
.8

can take values between 0 and 1 .6
But we do not observe .4
probabilities but the actual event .2
happening
OUTCOME
0.0
A straight line will predict values -.2
between negative and positive 0 10 20 30 40 50 60
infinity, outside the [0,1] interval! HSTUDY
130
What is wrong with LPM?
Coefficients
Unstandardized Coefficients Sig.
Model B Std. Error
1 (Constant) - 0.031861 0.161591 0.846994
HSTUDY 0.026219 0.006483 0.001627
a Dependent Variable: OUTCOME
Above is the SPSS output on the linear regression of ‘Outcome’

on Hours of Study
The results suggest that an increase in 1 hour of studying
increases the probability of passing the exam, on average, by
approx. 0.026 or 2.6%.
So what would the model predict if we studied 100 hours for
the exam?
131
Linear Probability Models (LPM) –

What is wrong with them?
Basically, the linear relation we had postulated before between
X and Y is not appropriate when our dependent variable is
dichotomic. Predictions for the probability of the event
occurring would lie outside the [0,1] interval, which is
unacceptable.
Other two subtle problems related to LPM:
Distribution of ui is not normal as we wished it to be
The variance of ui is not constant (problem of
heteroscedasticity)
132
Non Linear Probability Models
We want to be able to model the probability of the event

occurring with an explanatory variable ‘X’, but we want the
predicted probability to remain within the [0,1] bounds.
There is a threshold above which the probability hardly
increases as a reaction to changes in the explanatory
variable.
Many functions meet these requirements (non-linearity and

being bounded within the [0,1] interval).
Logistic function is of such kind.
The Logistic Curve will relate the explanatory variable X to the

probability of the event occurring. In our example, it will relate
the number of study hours with the probability of passing the
exam. 133
Logistic Regression
Logistic regression, and related methods such as Probit analysis,
are very useful techniques when one wants to understand or to
predict the effect of a series of variables on a binary response
variable (a variable which can take only two values, 0/1 or
Yes/no, for example).
The methodology of logistic regression aims at modeling the
probability of success depending on the values of the
explanatory variables, which can be categorical or numerical
variables.
For example, a marketing researcher may want to detect if
customers are likely to renew their savings deposit/Loan Facility
Logistic regression can be helpful to model the effect of repeal
of a patent on profitability of textile firms or to examine the key
determinants of likelihood of a firm to export or to evaluate the
risk for a bank that a client will not pay back a loan
The Logit Model
A Logit Model states that:
Prob(Yi=1) = F (α + βXi)
Prob(Yi=0) = 1 - F (α + βXi)
Where F(.) is the ‘Logistic Function’.

So, the probability of the event occurring is a logistic
function of the independent variables
135
Logit Models
1
F (α + βX i ) = P(Yi = 1) =
1+ e−(b0 +b1X1 +εi )
When there is only one explanatory variable (X1), the Logistic

Function is defined as above.
Therefore, we will be interested in finding estimates for b0 and
b1 so that the Logistic Function best fits the data
136
How do we find the best Logistic Function
to fit our data?
We will estimate our model with Maximum Likelihood

techniques, which will be shown in the statistical packages.
Logistic Regression Can be Run in Statistical Packages like
STATA, SPSS etc.
One can also use XLSTAT to run a logistic regression in EXCEL
http://www.kovcomp.co.uk/support/XL-Tut/demo-log.html
http://www.kovcomp.co.uk/xlstat/index.html
One May also use Solver to in Excel to run Logistic Regression
Logit Regression in Credit Risk

Management
Logit regression technique is popularly used in developing credit
scoring models for Corporate/SMEs/Retail Loans
The logit model overcomes the weakness of “unboundedness”
problem of MDA and Liniear Probability Models by restricting the
estimated range of default probabilities to lie between 0 and 1.
Essentially this is done by plugging the estimated values of Z
from LPM into the formula: F(z)=1/(1+exp(z))
Where z is the dichotomous regression equation explaining how
two factors debt-equity ratio (D/E) and the sales-asset ratio
(S/A) predicting the probability of default of payment on a loan.
The estimated LPM regression equation from past data on
default behavior of borrowers is: Z=0.5(D/E)-0.3(S/A)
Assume a prospective borrower has D/E=0.3 and S/A=4. Its
expected LPM score=-1.05; and F(z)=1/(1+exp(-
1.05))=74.08%. It means Probability of odds of Solvency over
default is=74.08%
138
Logit Model in Credit Risk
Logistic regression is a simple and appropriate technique for
estimating the log of the odds of default as a linear function of
loan application attributes:
⎡ Pr ob( Default ) ⎤
ln ⎢ ⎥ = β 0 + β1 X 1 + β 2 X 2 + β 3 X 3 + ... + β k X k .
⎣ Pr ob( Solvency) ⎦
A logistic model has the flexibility of incorporating both the

qualitative and quantitative factors and is more efficient than
the linear regression probability model.
In logistic regression exercise, we are actually predicting the
probability of a loan default based on the financial, non financial
(qualitative borrower characteristics), situational factors
(location and local factors) obtained from the credit files of the
Bank and external macro economic conditions
139
Logistic Regression in Operational Risk

Management
Logistic regression is a useful tool for analyzing data that
includes binary dependent variables such as presence or
absence of a fraud and success or failure of a back office
process or system.
Logistic regression is simply nonlinear transformation of the
linear regression model. However, unlike OLS, it does not
require assumptions about normality.
The dependent variable is log odds ratio or logit
For example observed number of daily “computer system
failures” (coding of yes vs. no into 1,0 mode) are converted to
proportions which are then fitted by Logit model that determine
the probability that the computer system will fail today because
of other causative factors like presence of new trainees or staff
ratio, business volume etc.
140
Multiple Discriminant Analysis
Discriminant analysis is appropriate in situations where the
researcher may want to identify those variables/factors which
are effective in predicting group membership or what variables
discriminate well between groups.
141
MDA Analysis: Altman’s Z Score
¾ Altman (1968), for the first time, applied Multiple Discriminant Analysis
(MDA) in response to shortcomings of traditional univariate financial ratio
analysis.
MDA models are developed in the following steps :
¾ Establish a sample of two mutually exclusive groups: firms which have
“failed” and those which are still continuing to trade successfully
¾ Collect financial ratios for each of these companies belonging to both of
these groups
¾ Identify financial ratios which best discriminate between groups (F-test/
Wilk’s Lambda test).
¾ Established a Z score based on these ratios.
142
Altman’s Z-Score Model
Altman Z model predicts the probability of a company going bankrupt

within a period of 12 months:
Z = 1.2X1 + 1.4X2 + 3.3X3 + 0.6X4 + 0.999X5

Z = The Z score
X1 = Net Working Capital (NWC)/Total Asset (liquidity)
X2 = Retained Earnings/Total Asset (cumulative profitability)
X3 = Profit before Interest and Tax (PBIT)/total assets (productivity)
X4 = Market Value of Equity/Book value of Liabilities (movement in
the asset value)
X5 = Sales/Total Assets (sales generating ability)
143
Altman’s Z-Score Model
¾ It is a classificatory model for corporate customers
Î Zscore > 2.99 - firm is in a good shape

Î 2.99>Zscore>1.81 - warning signal
Î 1.81>Zscore-big trouble; firm could be heading towards
bankruptcy
Î Therefore, the greater a firm’s distress potential, the lower
its discriminant score
¾ Z-score model can be used as a default probability predictor

model
144
The Z score and weights
The discriminant coefficients can be estimated using following formula
based on 2 variables:
Z=aX+bY where X=TOL/TA and Y=CR;
where
a={(VarY(avg.Xsolv-avg.Xdef))-(CovXY(avg.Ysolv-Ydef))}/((VarX×VarY)-(CovXY)^2)
b={(VarX(avg.Ysolv-avg.Ydef))-(CovXY(avg.Xsolv-Xdef))}/((VarX×VarY)-(CovXY)^2)
Where Cov XY=Σ(X-avgX)(Y-avgY)/n-1
avg. Xsolv=mean of variable X for borrowers in solvent category
avg. Xdef=mean of variable X for borrowers in defaulted group
avg. Ysolv=mean of variable Y for borrowers in solvent category
avg. Ydef=mean of variable Y for borrowers in defaulted category
The cut off Z-score is the combined benchmark for identified
independent variables to classify the prospective borrower into defaulted
or solvent category.
145
Regime Switching Regression-Dummy

Variable Regression
When an operation risk event is subject to regime shifts, the
parameters of the statistical model will be time-varying.
For example, consider the time series of minutes of system
downtime per month for a particular business unit. It may
happen that downtime fell sharply as a result of the business
unit outsourcing its IT administration. Thus, the change in
management policy had a direct impact on the stochastic
behavior of the OR event “system downtime”.
One simple approach for capturing regime shift would be to
model regression using either intercept or slope dummies:
Before and After IT function has been outsourced.
146
Multivariate Regression Interpretation
(Including Regime Shifting): Ex2
Repo Rate Determinants (using Quarterly data from Nov. 2000-Dec 2007
Jt. Jalan & Reddy Period(Nov. Reddy Period

2000-07) (Sept 2003-Dec 2007)
All Only money All Only money
determinants supply determinants supply
Lagged % change GDP -0.22*** --- -0.027 ---
Lagged % change WPI 0.013 --- -0.01 ---
Lagged % change M3 0.25** 0.24*** 0.24*** 0.24***
Constant 4.9*** 3.67*** 3.22* 3.00***
Adjusted R-squared 0.47 0.22 0.84 0.84
Observations 29 29 18 18
Note: Absence of * indicates not significant even at 5% level of confidence; * indicates
significant at 5%, ** at 1% and *** at .1%
Source: “Triplets: RBI, money supply and the repo”,

by Surjit Bhalla, Business Standard, Jan 6, 2009
147
Linear Probability or Multiple Discriminant

Analysis (MDA) for Op-Risk Scores & Rating
Migration Analysis
MDA, a popular statistical classification technique popularly applied
in credit risk management can be applied to design Qualitative Risk
Rating of Operational Risk Management Process in a Bank (as part
of QLA).
To apply this technique, one has to devise a risk map that
encompasses all types of operational risk (legal, compliance,
operations, security, system, etc.) and then the key qualitative as
well as quantitative factors may be selected through MDA statistical
analysis to develop risk level of operational process of a Bank (e.g.
transaction processing in each business). Such exercise would also
help us to understand the key factors that affect the risk rating.
The rewarded score will determine the OR Rating (in 7 or 8 scale).
Next, one can study the rating migration of credit/operational risk
over the time (following Markov chain method) that would help the
top management to control or monitor risk and to identify the
factors responsible for more downgrades etc.
148
Modeling Computer System Failure Risk
with Logistic Regression
Processing risk covers losses from back office operations. It
includes, among other factors, the failure of computer systems..
Logistic regression allows us to calculate the probability associated
with such a failure, given assumed risk indicators.
For Example, given the data we make hypothesis that the
probability of computer failure is related to the ratio of available
staff (systems support and maintenance) to all available staff on a
particular day and the volume of computer related business activity
as a proportion of the recommended maximum capacity of the
system.
ln[p/(1-p)=α+β1(staff ratioi)+β2(volumei)+єi
After estimating the logistic regression equation using the daily
data, we obtain the following estimated coefficients (p-values in
brackets):
ln[p/(1-p)]=-1.9936 (0.001)+1.416(0.002)×Staff_ratio
+1.3917(0.003)×Volume
149
Note: McFadden’s Pseudo R2 measures the goodness of fit
Diagnostic Checks-validation of regression

estimation
Heterscedasticity: when data violates the assumption that the
disturbance terms all have the same variance.
It affects the regression coefficients and making incorrect decision
concerning the reliability of the partial regression coefficients.
Multicollinearity: Causes linear dependence between explanatory
variables and affects the precision of the estimators (low significance
of variables but high R2).
If your goal is simply to predict Y from a set of X variables, then
multicollinearity is not a problem. The predictions will still be
accurate, and the overall R2 (or adjusted R2) quantifies how well
the model predicts the Y values.
If your goal is to understand how the various X variables impact
Y, then multicollinearity is a big problem. One problem is that the
individual P values can be misleading (a P value can be high, even
though the variable is important).
Solutions: The best solution is to understand the cause of
multicollinearity and remove it. You can also reduce its impact by
increasing the sample size.
150
Diagnostic Checks…
Serial Correlation: When error terms are serially correlated (can be
identified by looking at Durbin Watson stat (<2)
Serial correlation will not affect the unbiasedness or consistency of
OLS estimators, but it does affect their efficiency (standard error
may be low due to +ve serial corrln) which will lead to a tendency
to wrongly reject the null hypothesis when it should not be
rejected.
151
Time Series Techniques of Forecasting

A time series is a set of observations on a variable measured at
successive points of time. E.g. interest rate movements, price
movements, GDP trends, yield on bonds/spread movement, credit
growth etc.
In order to forecast the time series variables more accurately, one has
to devise a scheme to describe the movement in a time series
adequately.
For this purpose, we need to decompose the time series into four
components:
Tend component-long term smooth movement.
Cyclical component represents oscillatory movement around the
trend (means ups and downs). One has to find out lag to capture
cyclicality.
The seasonal component is also oscillatory character but is strictly
confined to intra year movement. This mainly happens in
daily/weekly/monthly data series.
152
The following data represent the closing value
of the Dow Jones Industrial Average for the
years 1980 - 2001.
153
Time Series Plot
154
Monthly WPI Series
220
200
180
160
140
120
98 99 00 01 02 03 04 05 06 07
WPI 155
Time Series of Monthly Aaa Bond Yields
Monthly Data on Bond Yields

10.0
9.5
9.0
Yield
8.5
8.0
7.5
7.0
6.5
1990 1991 1992 1993 1994
Month BO ND_YLD 156

Trend Analysis
The trend component of a time series reflects the long run
movement. Usually, it is a rising or falling smooth curve.
It is very advantageous for forecasting population growth, credit
growth trend etc. if we can represent the trend by a simple
mathematical function.
The linear trend is a special case of polynomial trend. An nth degree
polynomial trend can be written as:
Yt = a 0 + a1t + a 2 t 2 + a 3 t 3 + ... + a n t n .
One can perform OLS regression to get trend fitting.
Suppose one has a yearly data series of Bank’s credit supply from
1960 to 2006. He has to fit a trend with the actual series. For this, he
converts the actual credit series into natural log figure (base is e) and
can fit into a nth degree polynomial trend function like the above
equation and estimate the coefficients by Ordinary Least Square
(OLS) method that will give annual composite growth rate and also
can project the trend for future years. 157
ARIMA Technique
Any time series which contains no trend can be represented as consisting of two
parts: AR Process (lag dependent variable itself) and MA Process (lag error
dependence or serial correlation in the disturbance)
ARIMA model can improve forecasting power as it incorporates trend, cyclicality
and seasonality
STEPS in Building ARIMA model of forecasting:
A. Model Identification-Stationarity check, identifying level of statioarity of the
series (or order of integration) and AR and MA process specification
Methods-
Correlogram analysis-studying the Auto correlation function (ACF) and partial
auto correlation function (PACF) lag structure
Dickey-Fuller Unit Root Test
B. Model Estimation: Having determined the orders of the ARIMA model, the
model can be estimated in either EVIEWS 5 or STATA 9 using differenced
regression technique.
C. Diagnostic Checks: Once the ARIMA model is specified and parameters are
estimated, the adequacy of the models may be checked through Box-Pierce-
Ljung-residual test (or white noise test of the residual)
D. Forecasting: After diagnostic checks, the regression equation may be used to
generate short term (static) or long term (dynamic) forecasts 158
Graphical Presentation of ARIMA Process
AR (1) process:
MA (1) process:
159
Correlogram Study…
AR (2) process:
MA (2) process:
Note that ACF in

an explosive or
random walk
series does not
die out quickly 160
Correlogram Study
AR (1) MA (1) process:
Note: If, after the first difference, the ACF or PACF spikes in
correlogram get eliminated the series is I(1); if it happens in
second difference then it is I(2) process
161
Useful References
Greene, W. H. (2007). “Econometric Analysis”, Fifth Edition, Low Price Edition,
Pearson Education.
Gujarati, D N (2004): “Basic Econometrics”, 4th Edition, Tata McGraw-Hill.
Johnston J., and DiNardo J (1997): “Econometric Methods”, 4th Edition, The
McGraw-Hill Companies, Inc. (Important for time series and panel data analysis).
Lewis, N. D. C “Operational Risk: Applied Statistical Methods for Risk
Management”, Wiley Finance.
Maddala, G S (1983): “Limited-Dependent and Qualitative Variables in
Econometrics”, Cambridge University Press.
Pindyck, R.S., and D. L. Rubinfeld (1981), “Econometric Models and Economic
Forecasts”, McGraw-Hill International Editions.
Vose, D. “A Guide to Monte Carlo Simulation Modeling”, John Wiley & Sons.
Walpole, R. E. (1982) “Introduction to Statistics”, Publisher: The Macmillan Co., NY.
Walpole, R. E., Sharon L Myers, Keying Ye, Raymond H. Myers (2006), “Probability
and Statistics”.
EVIEWS help, STATA help, SPSS 17 help etc.
@Risk and BestFit Software at Palisade: www.palisade.com.au
162
Thank You
My Email: arindam@nibmindia.org
163

Application of Statistical Tools in EmpR Research-AB

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Application of Statistical Tools in EmpR Research-AB

Uploaded by

Copyright:

Available Formats

Application of Statistical Tools in

Dr. Arindam Bandyopadhyay

National Institute of Bank Management

What is Quantitative Research?

1. Interest->The topic or theme of research

Problem Solving Approach

 Frequency Distribution (grouped)-histogram, frequency curve,

 Grouped frequency distribution is a tabular summary of data

 Frequency = simple count of the cases with a certain variable

Frequency = simple count of the cases with a certain variable value

Quartiles and Percentiles

Descriptive Statistics: Mean, VARIANCE,

 For a normal distribution, skewness=0

 Distribution is fully described by 4 moments

 Relative measure of Sk=m3/SD3 ranges between + and -3 (Sk=0 indicates

Gini Coefficient Measure of Inequality

 Therefore, the Gini coefficient: G = 1 − ∑ pi ( zi + zi −1 )

We have used declies to slice the loan zonal loan exposure

Basic concepts in Probability

 Probability is a numerical measure of the likelihood that an event will

 Addition law: Probability of either of 2events occurring

 Joint Probability: Probability of 2 events both occurring

 Tossing an unbiased coin- H, T - r=1, s=1

 Tossing 2 unbiased coins- TT, TH, HT, HH r=2 s=2; P(A=one

 What is probability that either of the two coins give heads ?

Drawing without replacement

Condition Probability (with replacement):

 From two sets of portfolios A and B with shares yielding Profit

 What is the probability of picking loss making share given that

Conditional Probability: Ex2

 Ex. In a bolt factory, machines A, B and C manufacture respectively

Scenario Asset Value

Results of 10 possible scenarios for

Number of Occurrence in Each Range

Range Occurrence per bin

96 98 100 102 104 106

Histogram of Credit Loss Scenarios

96 98 100 102 104 106

96 98 100 102 104 106

Measure of Relative Location

 Z-scores or standardised values is the no. of standard deviations

 Chebychev’s theorem enables us to estimate the proportion of

 Normal probability distribution: A continuos probability distribution,

 Probability of change in price being between 65 and 75 bps

 Probability of change in price being less than or equal to 60 bps

 For example, there is only a 5% chance that a variable drawn

 Stdev(X), the standard deviation of X=$73,812, and z is the

 Corresponding to 550, we find Z=(550-600)/10=-5 and therefore, Pr( <550 )

Confidence Interval & Precision about

 A confidence interval is an interval constructed from a sample, which

Histogram of Bond Default Losses

Std. Dev. 72.44619

50 100 150 200 250

96 98 100 102 104 106

Bank’s Loan Loss Distribution

5.0% 90.0% 5.0%

Fitted Loss Distribution through

5.0% 90.0% 5.0%

One-tailed test vs. Two-tailed Hypothesis

Solution1: Two-tailed test

Solution2: One-tailed test

Test of Hypothesis: Mean = 128.0000

Sample Mean = 127.6842

Method Value Probability

Frequency Distribution (grouped)-histogram, frequency curve,

Grouped frequency distribution is a tabular summary of data

Frequency = simple count of the cases with a certain variable

For a normal distribution, skewness=0

Distribution is fully described by 4 moments

Relative measure of Sk=m3/SD3 ranges between + and -3 (Sk=0 indicates

Therefore, the Gini coefficient: G = 1 − ∑ pi ( zi + zi −1 )

Probability is a numerical measure of the likelihood that an event will

Addition law: Probability of either of 2events occurring

Joint Probability: Probability of 2 events both occurring

Tossing an unbiased coin- H, T - r=1, s=1

Tossing 2 unbiased coins- TT, TH, HT, HH r=2 s=2; P(A=one

What is probability that either of the two coins give heads ?

From two sets of portfolios A and B with shares yielding Profit

What is the probability of picking loss making share given that

Ex. In a bolt factory, machines A, B and C manufacture respectively

Z-scores or standardised values is the no. of standard deviations

Chebychev’s theorem enables us to estimate the proportion of

Normal probability distribution: A continuos probability distribution,

Probability of change in price being between 65 and 75 bps

Probability of change in price being less than or equal to 60 bps

For example, there is only a 5% chance that a variable drawn

Stdev(X), the standard deviation of X=$73,812, and z is the

Corresponding to 550, we find Z=(550-600)/10=-5 and therefore, Pr( <550 )

A confidence interval is an interval constructed from a sample, which

The probability of a Type I error is designated by the Greek

Discrete (for Events Prediction) & Continuous

Jarque-Bera Statistics-Tests the normality of a distribution

Using Chebyshev’s theorem we can say that at least

Regression analysis is concerned with the study of relationship between one

OLS method of estimation minimises the sum of the squared error

Mean value of ui=0

Equal variance ui (Homoscedasticity)

No autocorrelation or no correlation between ui and uj

Zero covariance between Xi and ui

No multicollinearity Cov(Xi Xj)=0 , multivariate regression

Simple linear regression is a conditional expectation.

R2=ESS/TSS measures explanatory power of regression model.

Let us remember what we have learnt about Simple Linear

X= Investment in R&D; Y= New Products introduced