You are on page 1of 24

1

INTRODUCTION &
OBJECTIVES
❖ What is abalone ?

❖Objective : In our data set we were given certain different


measurements of their physical aspects .So our intended
task was to come up with a linear regression model that could
predict their age by taking those bunch of measurements
into consideration.

2
VARIABLES &
LITERATURE REVIEW…
❖Our response variable is the age which is equal to Rings + 1.5
❖Predictor variables are,
▪ Categorical : Sex
▪ Numerical :
▪ Length, Diameter, Height,
▪ Whole weight, Shucked weight, Viscera weight, Shell
weight

3
VARIABLES &
LITERATURE REVIEW…
❖There were bunch of analyses that has been carried out related
to abalone age prediction . Before we started off our model
building process, we have thoroughly gone through them to
capture subtle things that they have implemented to acquire a
high-performance model. So most certainly we have decided to
incorporate them techniques in our model .But our model is
unique from the rest as we have considered interactions

4
DESCRIPTIVE ANALYSIS
UNIVARIATE ANALYSIS

• The response variable Age appears


to follow a normal distribution with a
mean around 10. The distribution of
Age seems to be slightly positively
skewed.
• The explanatory variables Length
and Diameter both appear to follow
a negatively skewed normal
distribution.
• Shucked Weight, Viscera Weight and
Shell Weight seems to follow an
exponential distribution.
• Whole Weight seems to follow a
multimodal exponential distribution
as it represents other Weight
variables as well.

Presentation title 5
• Most of the shells have an age
between 5 and 15. And
significant number of outliers
can be recognized in the age
group 15 and 25. A couple of
extreme outliers can be
identified near age 30.
• 2 extreme outliers can be
recognized in independent
variable Height and its small
spread can also be observed
from the boxplot.
• Independent variables like
Length and Diameter has only
lower bound outliers, while all
the Weight variables have
upper bound outliers.

6
It can be identified that a higher number of male species are in the sample while there’s an almost
equal number of Female and infant types.
The Average Percentage Of Each Weight Type Which Consists In Whole Weight Is Shown Here.

Most Of The Whole Weight Seems To Consist Of Shucked Weight. Then Next Shell Weight.

The Least Percentage Is Accounted For Unidentified Weight Which Doesn’t Belong To Either Weight Types
Mentioned In The Data Set.
BIVARIATE
ANALYSIS
• It appears that every
independent variable
seems to be positively
correlated with response
variable Age.
• A clear correlation pattern
can be observed between
Age and Weight type
variables. And another
distinct correlation pattern
can be observed between
Age and the independent
variables Length and
Diameter.

Presentation title 9
• Through the heat map
significance of the
correlations between
the variables can be
understood.
• There appear to be a
lot of high correlations
between independent
variables
indicating multicolline
arity in the model.

10
• through the violin plots the
distribution of every variable for
each sex can be understood as
well as the spread of data.
• For most of the variables both
Males and Females seems to
follow an identical distribution.
Only Infants seems to vary from
it. This can be clearly identified
in all the Weight variables.

11
• Here each continuous independent variable
is plotted against the response variable Age
by the categorical variable Sex.
• In every plot initial small values seemed to
be mostly occupied by Sex category infants.
• A clear correlation with Age can be seen with
variables Length and Diameter and Shell
Weight. In the plot between Age and Height
definite positive correlation can be observed
from the points but its correlation line
seemed to have deviated from it because of
an extreme correlation.
• Shucked Weight and Viscera Weight appear
to show small correlation with Age
compared to others.

12
FITTING THE MODEL

❖Our response variable was the age which is equal to Rings + 1.5
❖The variable ‘age’ recorded by us to the original dataset, because
our objective was to predict the age of abalones.
❖Let’s investigate the procedure that we used to successfully build
our model to predict the age of the abalone.

13
FITTING THE MODEL…
FORWARD, BACKWARD OR STEPWISE?
❖First, we looked at all the model building methods to find out
which one is the best (forward selection, backward elimination
and the stepwise selection method)
❖We got the same final model regardless of the method we use.
❖All the methods suggested us to remove the “Length” variable
because when we did the ANOVA test Length is the only variable
which was insignificant to the model.

14
FITTING THE MODEL…
MULTICOLLINEARITY
❖After fitting the model with main effects, we only got an R
squared value of 0.53.
❖At this point we decided to check the multicollinearity of the
variables in the model.
❖But even though the “whole weight” has a big V.I.F value when
we do the ANOVA test to the fitted model and the model without
whole weight, we got the result as fitted model is more adequate.

15
FITTING THE MODEL…
LOW R SQUARED?

❖Since, out model had a low R squared value then we decided to


go for the two-way interaction terms.
❖Following the backward elimination method, we got the final
model with two-way interactions.
❖But still R squared value of our model was low (0.59).

16
FITTING THE MODEL…
TRANSFORMATIONS
❖Finally, We decided to do a log transformation to our response
variable.
❖It improved the R squared value to 0.64!
❖So, we got our final model as,
Log(age) = 1.39 + 1.96(Shell. Weight) – 3.67(Shucked. Weight) + 2.7(Diameter)+1.52(Whole. Weight) –

0.13(SexI) – 0.05(SexM ) + 5.38(Height) - 2.76( Viscera. Weight) -2.3(Shell. Weight * Shucked. Weight) +
6.57(Shucked. Weight * Diameter) –8.83(Shucked. Weight * Height) + 1(Shucked. Weight * Whole. Weight)
+1.09(Shucked. Weight * SexI) + 0.07(Shucked. Weight * SexM) –4.27(Diameter * Whole. Weight) - 0.722(Diameter *
SexI) – 0.14(Diameter * SexM) –15.74 (Diameter * Height) + 6.97(Diameter * Viscera. Weight) - 0.26(Whole. Weight *
SexI)-0.02 (Whole. Weight * SexM) + 5.66 (Whole. Weight * Height)- 0.93(Whole. Weight * Viscera. Weight) +
1.54(SexI * Height) +0.7(SexM*Height)
Residual standard error: 0.1633 on 4151 degrees of freedom Multiple R-
squared: 0.6433, Adjusted R-squared: 0.6411 F-statistic: 299.4 on 25
17and 4151
DF, p-value: < 2.2e-16
MODEL PARAMETER INTERPRETATION

❖Final Model
❖ Log(age) = 1.39 + 1.96(Shell. Weight) – 3.67(Shucked. Weight) + 2.7(Diameter)+
1.52(Whole. Weight) – 0.13(SexI) – 0.05(SexM ) + 5.38(Height) - 2.76( Viscera. Weight) -
2.3(Shell. Weight * Shucked. Weight) + 6.57(Shucked. Weight * Diameter) –
8.83(Shucked. Weight * Height) + 1(Shucked. Weight * Whole. Weight) +
1.09(Shucked. Weight * SexI) + 0.07(Shucked. Weight * SexM) – 4.27(Diameter *
Whole. Weight) - 0.722(Diameter * SexI) – 0.14(Diameter * SexM) – 15.74 (Diameter *
Height) + 6.97(Diameter * Viscera. Weight) - 0.26(Whole. Weight * SexI) -0.02 (Whole.
Weight * SexM) + 5.66 (Whole. Weight * Height) - 0.93(Whole. Weight * Viscera.
Weight) + 1.54(SexI * Height) + 0.7(SexM*Height)

18
MODEL PARAMETER INTERPRETATION

❖ We have used the natural log transformation for our response variable as it has
substantially arisen our model performance. Hence , to get the predicted value for
age we have to consider the exponential value of the model output
❖ It is complex to a certain extent for us to lay down a comprehensive explanation of
how each variable hold itself accountable for admitting the changes in our
response variable as there are magnificent number of interaction terms appeared
in our model. So instead of interpreting all of them we just going to interpret the
parameters of 2 different variables which are not involving that much in the
interaction terms , so it will be reflective of how others has to be done.

19
MODEL PARAMETER INTERPRETATION

❖ Shell Weight
▪ Age is expected to be changed by a factor of exp(1.96
+2.3*Shucked.weight) for a given shucked.weight as a result of
one unit increment of of "Shell.weight"

❖ Viscera Weight
▪ Age is expected to be changed by a factor of exp(-2.76
+6.7*Diameter -(0.93*Whole.weight) or given Whole.weight and
diameter as a result of one unit increment of "Viscera Weight"
20
MODEL PARAMETER INTERPRETATION

❖ In the model parameter interpretation , there is another important


fact where our attention got to be drawn ,which is when we change
a variable where it is making an interaction effect with our
categorical variable gender , we have to make sure that we admit
it for all the 3 different cases (infant , male,female)

21
RESIDUAL ANALYSIS

❖ As the depicted plot illustrates , we could assure that


following assumptions of our model has been preserved
▪ Homoscedasticity (constant variance)
▪ Linearity (No non-linear patterns)
▪ Independence of observations
▪ Residuals are randomly scattered
around zero line with a horizontal
band.

RESIDUALS VS FITTED VALUES PLOT


Presentation title 22
RESIDUAL ANALYSIS…

❖This Q-Q plot attest the fact that our residuals


approximately follows a normal distribution

Q-Q PLOT

Presentation title 23
24

You might also like