You are on page 1of 7

Feature Selection

The nature of feature selection technique

In large scale high dimensional data that comes out from IOT sensors, healthcare with hundreds and
thousands of features, we need to figure out what subset of features will bring out a good sustaining
model.

In a Dataset, when ‘raw’ often comes with many irrelevant features that do not contribute much to the
accuracy of your predictive model.

For example, when you understand that when using music analogy music engineers often employ various
techniques to tune their music such that there is no unwanted noise and the voice is crisp and clear.

Similarly, even the datasets encounter noise and its crucial to remove them for better model
optimization.

Feature selection

• Reduces overfitting ‘The Curse of Dimensionality’ — If your dataset has more features/columns
than samples (X), the model will be prone to overfitting. By removing irrelevant data/noise, the
model gets to focus on essential features, leading to more generalization.
• Simplifies models — Dimensionality adds many layers to a model, making it needlessly
complicated. Overengineering is fun but they may not be better than their simpler counterparts.
Simpler models are easier to interpret and debug.
• Reduces training time — Lesser features/dimensions reduces the computation speed, speeding
up model training.

Feature Selection and Classification via


GMDH Algorithm in R
GMDH-type neural network algorithm is a heuristic self-organizing algorithm to model complex systems.
This ultimate guide involves feature selection and classification via GMDH algorithm for a binary
response.
We will work GMDH-type neural network approach for feature selection and classification when a
response with two classes exists. Before we start, we need to divide data into three parts; train, validation
and test sets. We use train set for model building. We utilize validation set for neuron selection. Last, we
show the performance of the model on test set.

we will implement the algorithm on breast cancer dataset, also used in the work done by Dag et al. (2019),
available in mlbench package (Leisch and Dimitriadou, 2010). Before we go ahead, we load dataset and
start to process the data.
Nest step will be defining the input and output variables and we need to divide data into three sets; train
(60%), validation (20%) and test (20%) sets. For reproducibility of results, let’s fix the seed number to 100.
Then, we obtain the number of observations in each fold.

Now let’s obtain the indices of train, validation and test sets. Before we obtain the indices, we shuffle
the indices to prevent any bias based on order.
Next, we can construct train, validation and test sets.

After obtaining train, validation and test sets, we can use GMDH-type neural network algorithm. GMDH
algorithm is available in GMDH2 package.
Now, let’s obtain performance measures on test set.
Based on the above, we can come to a conclusion with the accuracy of GMDH algorithm which is
estimated to be 0.9485. This algorithm classifies 94.85% of persons in a correct class. Also, sensitivity
and specificity are calculated as 0.8913 and 0.9778. The algorithm classifies 89.13% of the persons
having breast cancer, 97.78% of the persons not having breast cancer.

You might also like