You are on page 1of 12

ML –

DSC 3601

OCTOBER 11

20DSC216
M. VISHWA
MACHINE LEARNING

Machine learning is nothing but a field of study which allows computers


to “learn” like humans without any need of explicit programming. Machine
learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn,
gradually improving its accuracy.

Predictive Modelling
Predictive modelling is a probabilistic process that allows us to forecast
outcomes, on the basis of some predictors. These predictors are basically features
that come into play when deciding the final result, i.e. the outcome of the model.

Dimensionality Reduction
In machine learning classification problems, there are often too many
factors on the basis of which the final classification is done. These factors are
basically variables called features. The higher the number of features, the harder
it gets to visualize the training set and then work on it. Sometimes, most of these
features are correlated, and hence redundant. This is where dimensionality
reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set
of principal variables. It can be divided into feature selection and feature
extraction.
Dimensionality reduction technique can be defined as, "It is a way of
converting the higher dimensions dataset into lesser dimensions dataset
ensuring that it provides similar information." These techniques are widely
used in machine learning for obtaining a better fit predictive model while solving
the classification and regression problems. It is commonly used in the fields that
deal with high-dimensional data, such as speech recognition, signal processing,
bioinformatics, etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.
Importance of Dimensionality Reduction in Machine
Learning & Predictive Modelling
An intuitive example of dimensionality reduction can be discussed through
a simple e-mail classification problem, where we need to classify whether the e-
mail is spam or not. This can involve a large number of features, such as whether
or not the e-mail has a generic title, the content of the e-mail, whether the e-mail
uses a template, etc. However, some of these features may overlap. In another
condition, a classification problem that relies on both humidity and rainfall can
be collapsed into just one underlying feature, since both of the aforementioned
are correlated to a high degree. Hence, we can reduce the number of features in
such problems. A 3-D classification problem can be hard to visualize, whereas a
2-D one can be mapped to a simple 2 dimensional space, and a 1-D problem to a
simple line. The below figure illustrates this concept, where a 3-D feature space
is split into two 2-D feature spaces, and later, if found to be correlated, the number
of features can be reduced even further.

Components of Dimensionality Reduction


There are two components of dimensionality reduction:
Feature selection:
Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset to build a
model of high accuracy. In other words, it is a way of selecting the optimal
features from the input dataset. It usually involves three ways:
 Filter
 Wrapper
 Embedded
Feature extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when we
want to keep the whole information but use fewer resources while processing the
information. This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Techniques of dimensionality reduction:

Advantages of Dimensionality Reduction


 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.

Disadvantages of Dimensionality Reduction


 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is
sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to define
datasets.
 We may not know how many principal components to keep- in practice,
some thumb rules are applied.

IsoMap Embedding:
Isomap is a nonlinear dimensionality reduction method. It is one of several
widely used low-dimensional embedding methods. Isomap is used for computing
a quasi-isometric, low-dimensional embedding of a set of high-dimensional data
points. The algorithm provides a simple method for estimating the intrinsic
geometry of a data manifold based on a rough estimate of each data point’s
neighbors on the manifold. Isomap is highly efficient and generally applicable to
a broad range of data sources and dimensionalities. Isomap is a technique that
combines several different algorithms, enabling it to use a non-linear way to
reduce dimensions while preserving local structures.

High-level steps that Isomap performs:

 Use a KNN approach to find the k nearest neighbors of every data point.
Here, “k” is an arbitrary number of neighbors that you can specify within
model hyperparameters.
 Once the neighbors are found, construct the neighborhood graph where
points are connected to each other if they are each other’s neighbors. Data
points that are not neighbors remain unconnected.
 Compute the shortest path between each pair of data points (nodes).
Typically, it is either Floyd-Warshall or Dijkstra’s algorithm that is used
for this task. Note, this step is also commonly described as finding a
geodesic distance between points.
 Use multidimensional scaling (MDS) to compute lower-dimensional
embedding. Given distances between each pair of points are known, MDS
places each object into the N-dimensional space (N is specified as a
hyperparameter) such that the between-point distances are preserved as
well as possible.

Isomap in Python to reduce the dimensions of Data


Let’s now use Isomap to reduce the high dimensionality of pictures within
the MNIST dataset (a collection of handwritten digits). This will enable us to see
how different digits cluster together in a 3D space.
Setup:
We will use the following data and libraries:
 Scikit-learn library for
1) MNIST digit data from sklearn’s datasets (load_digits);
2) performing Isometric Mapping (Isomap);
 Plotly and Matplotlib for data visualizations
 Pandas for data manipulation
Let’s import libraries.

import pandas as pd # for data manipulation

# Visualization
import plotly.express as px # for data visualization
import matplotlib.pyplot as plt # for showing handwritten digits

# Skleran
from sklearn.datasets import load_digits # for MNIST data
from sklearn.manifold import Isomap # for Isomap reduction

#Next, we load the MNIST data.

digits = load_digits()

# Load arrays containing digit data (64 pixels per image) and their true labels
X, y = load_digits(return_X_y=True)

# Some stats
print('Shape of digit images: ', digits.images.shape)
print('Shape of X (training data): ', X.shape)
print('Shape of y (true labels): ', y.shape)

#Let’s display the first 10 handwritten digits, so we have a better idea of what
we are working with.
fig, axs = plt.subplots(2, 5, sharey=False, tight_layout=True, figsize=(12,6),
facecolor='white')
n=0
plt.gray()
for i in range(0,2):
for j in range(0,5):
axs[i,j].matshow(digits.images[n])
axs[i,j].set(title=y[n])
n=n+1
plt.show()

Isometric Mapping
We will now apply Isomap to reduce the number of dimensions for each record
in the X array from 64 to 3.

### Step 1 - Configure the Isomap function, note we use default hyperparameter values in this
example
embed3 = Isomap(
n_neighbors=5, # default=5, algorithm finds local structures based on the nearest neighbors
n_components=3, # number of dimensions
eigen_solver='auto', # {‘auto’, ‘arpack’, ‘dense’}, default=’auto’
tol=0, # default=0, Convergence tolerance passed to arpack or lobpcg. not used if
eigen_solver == ‘dense’.
max_iter=None, # default=None, Maximum number of iterations for the arpack solver. not
used if eigen_solver == ‘dense’.
path_method='auto', # {‘auto’, ‘FW’, ‘D’}, default=’auto’, Method to use in finding shortest
path.
neighbors_algorithm='auto', # neighbors_algorithm{‘auto’, ‘brute’, ‘kd_tree’, ‘ball_tree’},
default=’auto’
n_jobs=-1, # n_jobsint or None, default=None, The number of parallel jobs to run. -1 means
using all processors
metric='minkowski', # string, or callable, default=”minkowski”
p=2, # default=2, Parameter for the Minkowski metric. When p = 1, this is equivalent to
using manhattan_distance (l1), and euclidean_distance (l2) for p = 2
metric_params=None # default=None, Additional keyword arguments for the metric
function.
)

### Step 2 - Fit the data and transform it, so we have 3 dimensions instead of 64
X_trans3 = embed3.fit_transform(X)

### Step 3 - Print shape to test


print('The new shape of X: ',X_trans3.shape)

#Finally, let’s plot a 3D scatter plot to see what the data looks like after
reducing dimensions down to 3.

# Create a 3D scatter plot


fig = px.scatter_3d(None,
x=X_trans3[:,0], y=X_trans3[:,1], z=X_trans3[:,2],
color=y.astype(str),
height=900, width=900
)
# Update chart looks
fig.update_layout(#title_text="Scatter 3D Plot",
showlegend=True,
legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),
scene_camera=dict(up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.2),
eye=dict(x=-1.5, y=1.5, z=0.5)),
margin=dict(l=0, r=0, b=0, t=0),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
zaxis=dict(backgroundcolor='lightgrey',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
)))
# Update marker size
fig.update_traces(marker=dict(size=2))

fig.show()

Note Book Code:


As you can see, Isomap has done a wonderful job in reducing dimensions from
64 to 3 while preserving non-linear relationships. This enabled us to visualize
the clusters of handwritten digits in a 3-dimensional space.

Conclusions
Isomap is one of the best tools for dimensionality reduction, enabling us to
preserve non-linear relationships between data points.

You might also like