You are on page 1of 7

South Africa heart disease project

Omar M. Osama

Deyaa Eldeen A. Almahallawi

June 16, 2010

This report presents a study of various classication methods ap-
plied on South Africa heart disearse problem. It is found that the
neural network method is much better, i.e., the error down in range
between 15.5% and 16.5%.

1 Introduction
This report presents three dierent classiers that estimate the response of
the South Africe heart disearse problem. In this problemt there are nine
features, systolic blood pressure, cumulative tobacco, low densiity lipopro-
tein, cholesterol, adiposity, family history of heart disease (Present, Absent),
type-A behavior, obesity, current alcohol consumption and age at onset. And
the response is coronary heart disease. The classiers that are used is Linear
Discriminant Analysis, Logistic Regression, and Neural Netwrk.

2 Linear Discriminant Analysis performance

Before talking about LDA we have to talk about log-likelihood ratio.

2.1 Log-liklihood ratio

(C2 ) C
C2 (C1 )

let's explain this equation.

(C2 ): is the loss of choosing C2 as a prediction.
(C1 ): is the loss of choosing C1 as a prediction.

so if (C2 ) > (C1 ) we assign to C1
and if (C2 ) < (C1 ) we assign to C2 . Which makes a lot of sense.
since (C1 ) = L12 P (C1 |X = x) + L22 P (C2 |X = x)
and (C2 ) = L11 P (C1 |X = x) + L21 P (C2 |X = x).
L12 : is the loss when you assign to C2 but it is C1 .
and L21 : is the loss when you assign to C1 but it is C2 .
L11 and L22 is right decisions.
L12 P (C1 |X = x)+L22 P (C2 |X = x) C
C2 L11 P (C1 |X = x)+L21 P (C2 |X = x)[Lii = 0usualy]

after calculating
P (C1 |X = x) C1 L21

P (C2 |X = x) C2 L12

applying bayes' theorem we nd that:

P (X |C1 ) 1 /P (X) C1 L21

P (X |C2 ) 2 /P (X) C2 L12

P (X |C1 ) C1 2 L21

P (X |C2 ) C2 1 L12

f1 (X) C1 2 L21

f2 (X) C2 1 L12

taking ln for both sides:

f1 (X)
ln C
C2 th

f2 (X)

h (X) C
C2 th
if f1 (X) > th we assign to C1 where if f2 (X) > th we assign to C2 . Which
makes alot of sense as we assign according to the greatest prior.
Now let's begin with LDA.

2.2 How does it work?

First we estimate 1 and 2 for the observations from C1 and C1 respectively.

Then estimate for both classes, assuming that both of them comming from
the same distribution. (unlike QDA). Then we nd h (X) from the equation
h (X) = X0 1 (T2 T1 ) + (2 1 T2 1 1 T1 ) (2)
using equation (1). We can predict which class each observation belongs

3 Logistic Regression performance

Although of the word regression but it is a classication method.

3.1 What is Logistic Regression?

It is a generalized linear model used for prediction of the probability of oc-

currence of an event by tting data to a logit function logistic curve.
p (G = 1|X = x)
log = X
p (G = 2|X = x)
and from X we can nd the probability p (G = 1|X = x) (posterior
probability) by using the sigmoid function
p (G = 1|X = x) =
1 + exp(X)
Why logistic regression rather than linear regression ?
Because in linear rgression when X moves far enough on the X-axis.
Such values are theoretically unacceptable.
It is a very simple and easy way from the rst glance but the problem
is .

3.2 How does it work?

We usualy t logistic regression by maximizing the log-likelihood. To calcu-

late it we use this equation:
L() = {yi log (pi ) + (1 yi ) log (1 pi )}

L() = {yi xi log (1 + expxi )}

And to maximize it we set its rst-derivation to zero which is called score

() X
= xi (yipi ) = 0

To solve this equation we use Newton-Raphson algorithm. which requires

the second-derivative or hessian matrix.
2 () X
= xi xTi pi (1 pi )

With these information. It is obvious that there is no closed form solution

to nd . So we set old to some value (zero is acceptable unlike neural
networks) and update new then use new as old and so on.

new = old + (X T W X)1 X T (y p)

where p = ..

where pi is p (G = 1|X = xi ) (posterior probabil-

p1 (1 p1 )

0 0
... ..

0 p2 (1 p2 ) .
and W = .. ... ...


0 0 pn (1 pn )
Axiomatic question When will we stop?
When the very rst = 0. Knowing that p (G = 1|X
= x) =
. We can nd an initial posterior for all observations. p = ..


With this p vector we can calculate score function and hessian matrix and

devide the rst over the second then nd new new . Do it again and so on.
The algorithm tunes value to minimize score. When score is close to zero
then new w old . When that happens satises the model, Which makes
alot of sense.

3.3 Applying to the problem

We train on 300 observations and according to that we found that 9 fea-
tures are small. We decided to incease complexity by taking a non-linear
combination of some features. Because linear combination will make the
matrix singular. We choose those features according to its correlation with
response. When the correlation value is large, response will almost feel the
change in the trasformed feature. Which makes a lot of sense. In our case
the squared sbp and typea, and Multiplying the alcohol with age increase
the performance extremely which has no proof or mathematical equations,
It is measured by sense.

4 Neural Network performance

4.1 What is Neural Network?

(2) (2) (1)
Yk = (2) w0k + wmk (1) wm0 + Wm X

X (2)
Yk = (2) w0k + wmk Zm

From the very rst glance to those equations. It is ambiguas to under-

stand the meaning of NN. Les't simpilfy those equations.
X : The data Matrix.
W (1) and W (2) : The weights matrix.
M : The number of nuerons.
(1) and (2) : Non-linear functions.
Yk :The output.

The main idea is to project the data on the weights directions. Which makes
us look to data from diferent sides. Because data could be more understand-
able from that side.
(1) is the sigmoid function. Which adds exibility to the model. If we
use linear function instead of sigmoid function, the large values in x-axis will
be theoritically unacceptable.
(2) is the identity function or soft-max function, for regression or classi-
cation respectively.

4.2 How does it work?

We train on some data with some weights [w1 w2 . . . wI ] and test on another
using wi which minimize error. Hoping that we trained on some data which
are so close to the population. The question now is how to nd that w.
E(w) = (y (xi , w) yi )2

! !2
X X  
(2) (2) (1)
E(w) = (2) w0k + wmk (1) wm0 + Wm X yi
i=1 m=1

We have 2 problems in this equation. First it is so complicated in w. The

simplest network which has 1 input, 1 neuron 1 and output has 4 w. So there
in no closed form solution to solve the network. second the error function
is non-convex that means local minimum exists; however, we seek the global
one. So we have to solve it numerically.

4.3 Applying to the problem

tansig function used as (2) instead of softmax which works in a very strange
manner. Training stops when the performance gradient falls below 104

4.3.1 Before Regularization

Figure 1:
Best error before regularization

The best error is in range between 16.5% and 17% after many tries.

4.3.2 After Regularization

Figure 2:

The best error in range between 15.5% and 16.5 after many tries.