Professional Documents
Culture Documents
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
N.MAHESWARI (10UCS29)
C.MALARVIZHI (10UCS30)
VARIABLE REDUCTION
Reductionist Principal component analysis is a variable-reduction procedure.
It is useful when we have obtained data on a large number of variables and believe that there is some redundancy in those variables.
In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, we believe that it should be possible to reduce the observed variables into a smaller number of principal component s(artificial variables)that will account for most of the variance in the observed variables.
By information we mean the variation present in the sample, given by the correlations between the original variables.
The new variables, called principal components (PCs), are uncorrelated, and are ordered by the amount of the total information each retains.
Data preprocessing is an important part for effective machine learning and data mining
Dimensionality reduction is an effective approach to downsizing data.
Why PCA?
Most machine learning and data mining techniques may not be effective for high dimensional data Curse of Dimensionality Classification accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. For example, the number of genes responsible for a certain type of disease may be small.
Visualization:
PRINCIPAL COMPONENT ANALYSIS Reduces the number of predictors by finding the weighted linear combinations of predictors that retain most of the variance in the data set. These are called principal components PCA works only with continuous variables
DIMENSIONALITY REDUCTION
Prerequisite for dimensionality reduction is understanding the data, using e.g. data summaries (min, max, avg, mean, median, stdev) and visualization.
Domain knowledge should always be applied first to remove predictors known to be unapplicable (e.g. height for predicting client income) Correlation analysis, principal component analysis, and binning
CORRELATION ANALYSIS
With many variables there is usually overlap in the covered information. A simple technique for finding redundancies is to look at the correlation coefficients in a correlation matrix. Pairs that have a very strong positive or negative correlation contain a lot of overlap and are subject to removal.
PCA EXAMPLE
Application of Dimensionality Reduction The PCA is required in several scientific fields, such as psychometrics, telecommunications, electroencephalography, stock markets and others.
Customer relationship management
Text mining Image retrieval Microarray data analysis Protein classification Face recognition Handwritten digit recognition Intrusion detection
Feature Selection
Definition A process that chooses an optimal subset of features according to a objective function Objectives To reduce dimensionality and remove noise
Feature extraction: Feature reduction refers to the mapping of the original highdimensional data onto a lower dimensional space Given a set of data points of p variables {x1,x2,..xn} Criterion for feature reduction can be different based on different problem settings.
Features:
It is computationally inexpensive,
ALGORITHM:
The PCA algorithm consists of 5 main steps:
Subtract the mean: subtract the mean from each of the data dimensions. The mean subtracted is the average across each dimension. This produces a data set whose mean is zero.
Calculate the covariance matrix:
Where is a matrix which each entry is the result of calculating the covariance between two separate dimensions.
Choose components and form a feature vector: once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. So that the components are sorted in order of significance. The number of eigenvectors that you choose will be the number of dimensions of the new data set. The objective of this step is construct a feature vector (matrix of vectors). From the list of eigenvectors take the eigenvectors selected and form a matrix with them in the columns: Feature Vector = (eig_1, eig_2, ..., eig_n)
Derive the new data set. Take the transpose of the Feature Vector and multiply it on the left of the original data set, transposed: Final Data = RowFeatureVector x RowDataAdjusted where RowFeatureVector is the matrix with the eigenvectors in the columns transposed (the eigenvectors are now in the rows and the most significant are in the top) and RowDataAdjusted is the mean-adjusted data transposed (the data items are in each column, with each row holding a separate dimension).