You are on page 1of 5

Unit 3

Define PCA.How to maximise the variance of PCA:


PCA can be defined as the orthogonal projection of the data onto a lower dimensional
linear space, known as the principal subspace, such that the variance of the projected data
is maximized

Example of PCA for Data Compression:


Consider the example of offline digits dataset. it contains a large number of handwritten digits
images.

Aim:Reducing the number of variables in a dataset while retaining as much of the information as
possible.

In the case of image data, PCA can be used to identify the most important features in the images
and compress them into a smaller set of variables.

(1) Preprocess the images by flattening them into a vector of pixel values.

(2) Then, you would use PCA to identify the most important principal components of the

image data.

(3) Finally, you would reconstruct the images using only the most important principal

components, resulting in a compressed version of the dataset with lower dimensionality.

 Consider the Dimension given as D.Let us minimize the dimension to a lower dimension M.

More the value of M, more accurate will be the image.

Lesser the value of M, more the degree of compression.


Explain steps involved in PCA with an example:
Step 1: Subtract the mean from each of the dimensions.

Subtracting the mean makes variance and covariance calculation easier by simplifying

their equations.

Step 2: Calculate the covariance matrix.It is a symmetric matrix.

Step 3: Calculate the eigen vectors V and eigen values D of the covariance matrix.

Eigenvectors are plotted as diagonal dotted lines on the plot. (note: they are

perpendicular to each other).

 One of the eigenvectors goes through the middle of the points, like drawing a line of

best fit.

 The second eigenvector gives us the other, less important, pattern in the data.

Step 4: Reduce dimensionality and form feature vector.

The eigenvector with the highest eigenvalue is the principal component of the data set.

In our example, the eigenvector with the largest eigenvalue is the one that points down

the middle of the data.

Step 5: Derive the new data.

Feature Selection and it’s types:


Feature selection is a technique used in machine learning and data mining to identify and
select the most relevant features.

Need for Feature Selection:

1. To improve performance (in terms of speed,simplicity of the model).

2. To visualize the data for model selection.

3. To reduce dimensionality and remove noise.

4. Reduce overfitting,

 Features of FS:

1. Removing irrelevant data.


2. Increasing predictive accuracy of learned models.

3. Reducing the cost of the data.

4. Improving learning efficiency, such as reducing storage requirements and computational cost.

 The selection can be represented as a binary array, with each element corresponding to

the value 1, if the feature is currently selected by the algorithm and 0, if it does not occur.

 There should be a total of 2 M subsets where M is the number of features of a data set.

Here M=3

Types:
1.Filter Method
2.Wrapper Method
(1) Filter Method:
 The selection of features is independent of any machine learning algorithms.
 Features are selected on the basis of their scores in various statistical tests.

 The correlation is a main term here.

 Filter methods do not remove multicollinearity.

(2) Wrapper Method:


 (1) Subset of featues is created.
(2) Based on the inferences that we draw from the previous model, we decide to add or
remove features from your subset.

 These methods are usually computationally very expensive.

Wrapper methods include:

1. Forward Selection:
We start with having no feature in the model.
In each iteration, we keep adding the feature which best improves our model.
2. Backward Elimination:
We start with all the features.
Remove the least significant feature at each iteration.
3. Bidirectional Generation:
We perform both FS and BE concurrently.
4. Random Generation:
It starts the search in a random direction.
The choice of adding or removing a features is a random decision.

Filter Wrapper

Method Correlation Subset construction measuring performance

Time Fast Slow

Cost Cheap Expensive

Result Might fail Always provide best subset features

Overfitting Never May get prone to.

Derive Fischer Linear Discriminant using a example


PCA finds the most accurate data representation.However the direction of maximum
variance maybe useless for classification
So FLD is used which projects a line such that samples from different classes are well
separated.
Unit 4

You might also like