You are on page 1of 4

INTRODUCTION

The abalone dataset is a collection of measurements of different abalones' physical features.


There are 4177 examples of it. To demonstrate the algorithms in action, we'll use the Abalone
dataset that has previously been collected. With this data, we can create a number of regression
models to investigate how different independent variables affect our dependent variable, Rings.
Knowing how each factor influences the Abalone's age can help oceanographers, jewelers, and
businesses better examine their production, distribution, and pricing strategies. To understand
the data, you must first understand what it contains. Understanding the type (continuous
numeric, discrete numeric, or categorical) and meaning of each feature and the number of
instances and features in the dataset is essential.

This dataset comes from an original (non-machine-learning) study and received in December 1995

VARIABLES

 Sex: This is the gender of the abalone and has categorical value (M, F or I).

 Length: The longest measurement of the abalone shell in mm. Continuous numeric value.
 Diameter: The measurement of the abalone shell perpendicular to length in
mm. Continuous numeric value.
 Height: Height of the shell in mm. Continuous numeric value.
 Whole Weight: Weight of the abalone in grams. Continuous numeric value.
 Shucked Weight: Weight of just the meat in the abalone in grams. Continuous numeric
value.
 Viscera Weight: Weight of the abalone after bleeding in grams. Continuous numeric
value.
 Shell Weight: Weight of the abalone after being dried in grams. Continuous numeric
value.
 Rings: This is the target that is the feature that we will train the model to predict. As
mentioned earlier, we are interested in the age of the abalone and it has been established that
number of rings + 1.5 gives the age. Discrete numeric value.

PAIRPLOT- Pair plot is used to understand the best set of features to explain a relationship between two
variables or to form the most separated clusters.

Observations from Pair Plot : THE PAIRPLOT INDICATES THAT-

 First thing to note here is high correlation in data. There seems to be high multicollinearity
between the predictors. for example correlation between Diameter and Length is extremely
high (about 98.7).
 Similarly Whole_weight seems to be highly correlated with other weight predictors and is the
sum of Shucked_weight, Viscera_weight and Shell_weight.
 Secondly, the distributions of predictor Sex with factor level value of female and male are very
similar with respect to all other predictors.
 The shape of distribution is also significantly similar for factor levels of female and male.

 We could think about redefining this feature to define gender as infant vs non-infant
(where non-infant = female and male both).

 Most of the abalones rings are between 5 and 15.

NEXT PAGE

1. RESIDUALS VS FITTED

The first plot tests for linearity and heteroskedasticity.


Based on this plot,we can see that the residuals are
heteroskedastic which indicates dependency between the
residuals and the fitted values.
2. NORAL Q-Q
The second plot tests for the normality of the residuals. This plot shows that
our residuals are “heavy tailed” .

3. SCALE- LOCATION

The third plot also tests for heteroskedasticity. Like plot 1, the residuals are
not evenly spread across the line.

4. Residuals vs Leverage
This plot helps us to find if any specific data points are influencing the model.
The two points that are leveraging our outcome is the “1418” and “2052” 

CLASSIFICATION

We'll use four classifiers to classify the data: random forest, decision tree, KNN and SVM. We'll
also figure out which parameters are best for each classifier. We don't utilise cross validation to
find the optimal parameter because there are numerous objectives with a total of 1. We utilise the
simple grid search strategy to find the optimal parameter for each classifier.
RANDOM FOREST

Random Forest is a type of ensemble learning technique that creates a large number of decision
trees during training. For classification problems, it predicts the mode of the classes, and for
regression tasks, it predicts the mean of trees. During tree construction, it employs the random
subspace approach and bagging. It comes with a built-in feature importance indicator.

K-NN

KNN is a Supervised Learning algorithm that predicts the output of data points using a labelled
input data set.It is one of the most basic Machine Learning algorithms, and it may be used to
solve a wide range of issues. It is primarily based on resemblance of features. KNN compares a
data point's similarity to that of its neighbour and assigns it to the most similar class.KNN is a
non-parametric model, which means it makes no assumptions about the data set, unlike most
algorithms. Because the algorithm can now handle realistic data, it becomes more effective.KNN
is a lazy algorithm, which implies that instead of learning a discriminative function from the
training data, it memorises it. Both classification and regression problems can be solved with
KNN.
SVM

Support vector machines (SVMs) are supervised learning models with related learning
algorithms for classification and regression analysis in machine learning. It's primarily used to
solve categorization challenges. Each data item is displayed as a point in n-dimensional space
(where n is the number of features), with the value of each feature being the value of a specific
coordinate in this algorithm. The hyper-plane that best distinguishes the two classes is then used
to classify the data. SVMs may also conduct non-linear classification, implicitly translating
their inputs into high-dimensional feature spaces, in addition to linear classification.

DECISION TREE

In machine learning, a Decision Tree is a supervised method. It assigns a target value to each
data sample using a binary tree graph (each node has two children). The tree leaves represent
the target values. Starting at the root node, the sample is propagated through nodes until it
reaches the leaf. A choice is made in each node about which descendant node it should travel
to. The feature of the selected sample is used to make a choice. It is usually one of the factors
considered while making a decision (one feature is used in the node to make a decision). The
process of discovering the best rules at each internal tree node based on the chosen metric is
known as decision tree learning.

You might also like