You are on page 1of 6

SA3 Assignment 1

V S Prasannakumar Mamidala
CBA Batch 3 Section B
Student ID: 71420100

1. Why do count data need to be modeled differently


from the standard linear regression model?

Linear regression using Ordinary Least Squares (OLS) method assumes that the
data values are normally distributed and it can take any values whether it is
positive or negative, integer or fraction which is not the case with count
data which will only have integer values.

Also as professor mentioned during the class, the data are a random sample
of the population where the errors are statistically independent from one
another, need to satisfy Homoscedasticity (equality of variances of data
points) and Normality of errors (for purposes of hypothesis testing) in case of
Linear regression using Ordinary Least Squares (OLS) method. These
assumptions may not hold true for count data.

Count data tend to take Poisson distribution, but not normal distribution as
assumed by Linear regression using Ordinary Least Squares (OLS) method .

2(a) Analyze data to study the effects of Drug and physician Age on the number
of prescriptions. Also, study the effect of interaction between Drug & Age.
(i) Poisson Model with both drug & age
Log(Pres) = 0 + D1 * (Drug B) + D2 * (Age O) + D3 * (Age Y)

2(a) Analyze data to study the effects of Drug and physician Age on the number
of prescriptions. Also, study the effect of interaction between Drug & Age.
(ii) Poisson Models with NULL model, drug & age individually make a note of residual deviance

As
As explained
explained in
in tutorial
tutorial 1,
1, Inclusion
Inclusion of
of each
each variable
variable in
in the
the
Poisson
Poisson equation
equation improves
improves the
the model
model performance
performance in
in such
such a
a
way
way that
that as
as the
the number
number of
of parameters
parameters increases
increases in
in the
the
model
model ,residual
,residual deviance
deviance decreases
decreases eventually
eventually which
which
means
means the
the model
model with
with all
all variables
variables fits
fits the
the data
data well
well when
when
compare
compare to
to other
other models.
models. If
If the
the model
model exactly
exactly fits
fits the
the data
data
then
then the
the residual
residual deviance
deviance is
is zero.
zero.
In
In this
this case,
case, the
the goodness
goodness of
of fit
fit shows
shows p
p of
of 1.46e-21;
1.46e-21;
Note
Note AIC
AIC =
= 493
493
Since
Since residual
residual deviance
deviance is
is greater
greater than
than its
its degrees
degrees of
of
freedom
freedom then
then there
there is
is an
an over
over dispersion
dispersion in
in the
the data.
data.
Let
Let us
us fit
fit the
the Negative
Negative Binomial
Binomial Regression
Regression and
and see
see whether
whether
NB
model
fits
the
data
better
than
Poisson
Model
NB model fits the data better than Poisson Model or
or not.
not.

2(a) Analyze data to study the effects of Drug and physician Age on the number
of prescriptions. Also, study the effect of interaction between Drug & Age.
(iii) Negative Binomial Regression Model

As
As can
can be
be seen
seen from
from these
these screenshots
screenshots of
of R
R outputs,
outputs, as
as the
the number
number of
of parameters
parameters
increases
increases in
in the
the model
model ,residual
,residual deviance
deviance decreases
decreases eventually
eventually which
which means
means the
the model
model
with
with all
all variables
variables fits
fits the
the data
data well
well when
when compare
compare to
to other
other models.
models. If
If the
the model
model
exactly
exactly fits
fits the
the data
data then
then the
the residual
residual deviance
deviance is
is zero.
zero.
In
In this
this case,
case, the
the goodness
goodness of
of fit
fit shows
shows p
p of
of 0.33;
0.33; Note
Note AIC
AIC =
= 53
53

Comparison:
Residual Deviance(NB) < Residual Deviance (Poisson)
AIC(NB) < AIC(Poisson)
By this, we can say that : As compare to Poisson regression,
Negative binomial regression fits the data better.

(b) Suppose that there are two missing values in the above data in the following way:
one value missing in second coumn - 5th row and other value missing in third coumn - 8th row.
Apply a suitable regression technique to impute the missing values.(15 marks)

From earlier analysis, we know the negative binomial


regression value using which we can compute missing
value of prescriptions in row 5, given the values of Drug &
Age are known.

As seen above, the imputed value for Prescriptions in row


5 will be 39
Similarly, using the same equation, we can calculate the
missing value of drug to be A. The same value is also
obtained using K-nearest neighbor (KNN) technique.