Professional Documents
Culture Documents
In large scale high dimensional data that comes out from IOT sensors, healthcare with hundreds and
thousands of features, we need to figure out what subset of features will bring out a good sustaining
model.
In a Dataset, when ‘raw’ often comes with many irrelevant features that do not contribute much to the
accuracy of your predictive model.
For example, when you understand that when using music analogy music engineers often employ various
techniques to tune their music such that there is no unwanted noise and the voice is crisp and clear.
Similarly, even the datasets encounter noise and its crucial to remove them for better model
optimization.
Feature selection
• Reduces overfitting ‘The Curse of Dimensionality’ — If your dataset has more features/columns
than samples (X), the model will be prone to overfitting. By removing irrelevant data/noise, the
model gets to focus on essential features, leading to more generalization.
• Simplifies models — Dimensionality adds many layers to a model, making it needlessly
complicated. Overengineering is fun but they may not be better than their simpler counterparts.
Simpler models are easier to interpret and debug.
• Reduces training time — Lesser features/dimensions reduces the computation speed, speeding
up model training.
we will implement the algorithm on breast cancer dataset, also used in the work done by Dag et al. (2019),
available in mlbench package (Leisch and Dimitriadou, 2010). Before we go ahead, we load dataset and
start to process the data.
Nest step will be defining the input and output variables and we need to divide data into three sets; train
(60%), validation (20%) and test (20%) sets. For reproducibility of results, let’s fix the seed number to 100.
Then, we obtain the number of observations in each fold.
Now let’s obtain the indices of train, validation and test sets. Before we obtain the indices, we shuffle
the indices to prevent any bias based on order.
Next, we can construct train, validation and test sets.
After obtaining train, validation and test sets, we can use GMDH-type neural network algorithm. GMDH
algorithm is available in GMDH2 package.
Now, let’s obtain performance measures on test set.
Based on the above, we can come to a conclusion with the accuracy of GMDH algorithm which is
estimated to be 0.9485. This algorithm classifies 94.85% of persons in a correct class. Also, sensitivity
and specificity are calculated as 0.8913 and 0.9778. The algorithm classifies 89.13% of the persons
having breast cancer, 97.78% of the persons not having breast cancer.