13 views

Uploaded by Jorge Ramírez

Attribution Non-Commercial (BY-NC)

- image processing
- cv_simeoni_matthieu.pdf
- Eigenvalues and Eigenvectors
- Vegetation analysis and species diversity in the desert ecosystem of coastal wadis of South Sinai, Egypt.pdf
- GenStat 5
- Stastical Analysis of the Time Course of Biolog Substrate Utilisation
- Pattern Classification
- Liu - 2002 - Block Principal Component Analysis With Application to Gene Microarray Data Classification
- A Comparative Study of Various Inclusive Indices and the Index Constructed by the Principal Components Analysis
- Analyses of human dentine and tooth enamel by laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS) to study the diet of medieval Muslim individuals from Tauste (Spain)
- Hw01 Data Frame Basics
- Woe Ivr Paper
- Blackwell Science, Ltd Phylogenetic variation in heavy metal accumulation in angiosperms
- What is a Principal Component Analysing
- 3. Ms. Shelly Oberoi_2
- Face Recognition Using Laplacianfaces (Synopsis)
- Ashheim-02-Multistory Frame Buildings Respond
- Ear Biometrics(2)
- Lecture15 - Face Recognition2
- FauciCastSchulze-MusicGenreClassification

You are on page 1of 11

Steven M. Ho!and

Department of Geology, University of Georgia, Athens, GA 30602-2501

May 2008

Introduction

Suppose we had measured two variables, length and width, and plotted them as shown below. Both variables have approximately the same variance and they are highly correlated with one another. We could pass a vector through the long axis of the cloud of points and a second vector at right angles to the rst, with both vectors passing through the centroid of the data.

Once we have made these vectors, we could nd the coordinates of all of the data points relative to these two perpendicular vectors and replot the data, as shown here (both of these gures are from Swan and Sandilands, 1995).

In this new reference frame, note that variance is greater along axis 1 than it is on axis 2. Also note that the spatial relationships of the points are unchanged; this process has merely rotated the data. Finally, note that our new vectors, or axes, are uncorrelated. By performing such a rotation, the new axes might have particular explanations. In this case, axis 1 could be regarded as a size measure, with samples on the left having both small length and width and samples on the right having large length and width. Axis 2 could be regarded as a measure of shape, with samples at any axis 1 position (that is, of a given size) having dierent length to width ratios. PC axes will generally not coincide exactly with any of the original variables. Although these relationships may seem obvious, when one is dealing with many variables, this process allows one to assess much more quickly any relationships among variables. For data

sets with many variables, the variance of some axes may be great, whereas others may be small, such that they can be ignored. This is known as reducing the dimensionality of a data set, such that one might start with thirty original variables, but might end with only two or three meaningful axes. The formal name for this approach of rotating data such that each successive axis displays a decreasing among of variance is known as Principal Components Analysis, or PCA. PCA produces linear combinations of the original variables to generate the axes, also known as principal components, or PCs.

Computation

Given a data matrix with p variables and n samples, the data are rst centered on the means of each variable. This will insure that the cloud of data is centered on the origin of our principal components, but does not aect the spatial relationships of the data nor the variances along our variables. The rst principal components (Y1) is given by the linear combination of the variables X1, X2, ...,Xp

or, in matrix notation

Y1 = aT X 1

The rst principal component is calculated such that it accounts for the greatest possible variance in the data set. Of course, one could make the variance of Y1 as large as possible by choosing large values for the weights a11, a12, ... a1p. To prevent this, weights are calculated with the constraint that their sum of squares is 1.

a2 + a2 + ... + a2 = 1 12 1p 11

The second principal component is calculated in the same way, with the condition that it is uncorrelated with (i.e., perpendicular to) the rst principal component and that it accounts for the next highest variance.

This continues until a total of p principal components have been calculated, equal to the original number of variables. At this point, the sum of the variances of all of the principal components will equal the sum of the variances of all of the variables, that is, all of the original information has been explained or accounted for. Collectively, all of these transformations of the original variables to the principal components is

Y = AX

Calculating these transformations or weights requires a computer for all all but the smallest matrices. The rows of matrix A are called the eigenvectors of matrix Sx, the variancecovariance matrix of the original data. The elements of an eigenvector are the weights aij, and

are also known as loadings. The elements in the diagonal of matrix Sy, the variancecovariance matrix of the principal components, are known as the eigenvalues. Eigenvalues are the variance explained by each principal component, and to repeat, are constrained to decrease monotonically from the rst principal component to the last. These eigenvalues are commonly plotted on a scree plot to show the decreasing rate at which variance is explained by additional principal components.

0 log(variance) 2 4 6 8

10

15

20

principle component

The positions of each observation in this new coordinate system of principal components are called scores and are calculated as linear combinations of the original variables and the weights aij. For example, the score for the rth sample on the kth principal component is calculated as

In interpreting the principal components, it is often useful to know the correlations of the original variables with the principal components. The correlation of variable Xi and principal component Yj is

rij =

a2 V ar (Yj ) /sii ij

Because reduction of dimensionality, that is, focussing on a few principal components versus many variables, is a goal of principal components analysis, several criteria have been proposed for determining how many PCs should be investigated and how many should be ignored. One common criteria is to ignore principal components at the point at which the next PC oers little increase in the total variance explained. A second criteria is to include all those PCs up to a predetermined total percent variance explained, such as 90%. A third standard is to ignore components whose variance explained is less than 1 when a correlation matrix is used or less than the average variance explained when a covariance matrix is used, with the idea being

that such a PC oers less than one variables worth of information. A fourth standard is to ignore the last PCs whose variance explained is all roughly equal. Principal components are equivalent to major axis regressions. As such, principal components analysis is subject to the same restrictions as regression, in particular multivariate normality. The distributions of each variable should be checked for normality and transforms used where necessary to correct high degrees of skewness in particular. Outliers should be removed from the data set as they can dominate the results of a principal components analysis.

PCA in R

1) Do an R-mode PCA using prcomp() in R. To do a Q-mode PCA, the data set should be transposed before proceeding. R-mode PCA examines the correlations or covariances among variables, whereas Q-mode focusses on the correlations or covariances among samples. > mydata <- read.table(file="mydata.txt", header=TRUE, row.names=1, sep=",") > mydata.pca <- prcomp(mydata, retx=TRUE, center=TRUE, scale.=TRUE) # variable means set to zero, and variances set to one # sample scores stored in mydata.pca$x # loadings stored in mydata.pca$rotation # singular values (square roots of eigenvalues) stored # in mydata.pca$sdev # variable means stored in mydata.pca$center # variable standard deviations stored in mydata.pca$scale > > > > sd <- mydata.pca$sdev loadings <- mydata.pca$rotation rownames(loadings) <- colnames(mydata) scores <- mydata.pca$x

2) Do a PCA longhand in R: > R <- cor(mydata) # calculate a correlation matrix > # # # myEig <- eigen(R) find the eigenvalues and eigenvectors of correlation matrix eigenvalues stored in myEig$values eigenvectors (loadings) stored in myEig$vectors

> sdLONG <- sqrt(myEig$values) # calculating singular values from eigenvalues > loadingsLONG <- myEig$vectors

> rownames(loadingsLONG) <- colnames(mydata) # saving as loadings, and setting rownames > standardize <- function(x) {(x - mean(x))/sd(x)} > X <- apply(mydata, MARGIN=2, FUN=standardize) # transforming data to zero mean and unit variance > scoresLONG <- X %*% loadingsLONG # calculating scores from eigenanalysis results 3) Compare results from the two analyses to demonstrate equivalency. Maximum dierences should be close to zero if the two approaches are equivalent. > range(sd - sdLONG) > range(loadings - loadingsLONG) > range(scores - scoresLONG) 4) Do a distance biplot (see Legendre & Legendre, 1998, p. 403) > quartz(height=7, width=7) > plot(scores[,1], scores[,2], xlab="PCA 1", ylab="PCA 2", type="n", xlim=c(min(scores[,1:2]), max(scores[,1:2])), ylim=c(min(scores[,1:2]), max(scores[,1:2]))) > arrows(0,0,loadings[,1]*10,loadings[,2]*10, length=0.1, angle=20, col="red") # note that this scaling factor of 10 may need to be changed, # depending on the data set > text(loadings[,1]*10*1.2,loadings[,2]*10*1.2, rownames(loadings), col="red", cex=0.7) # 1.2 scaling insures that labels are plotted just beyond # the arrows > text(scores[,1],scores[,2], rownames(scores), col="blue", cex=0.7)

15

CJ001 CJ011

PCA 2

10

Na B K Li W Sr As Ba Si CaCd Th

AgTi CJ003 CJ005 CJ045 CJ043 Be CJ025 Au CJ041 d18O CJ023 DH Sb CJ047 CJ017Zn Cu JK013 Hg pH CJ009 Se DC002 Tl JK012 JK014 Mn Co CJ051 CJ053 Mo Cr Pb Mg YNi CJ063 U Ce CJ027 La CJ049 CJ007 VP Fe Al SO4 DC008 DC011 DC012 CJ015 CJ019 CJ029 CJ031 CJ039 CJ037 CJ035

CJ021

DC006

5 PCA 1

10

15

> quartz(height=7, width=7) > biplot(scores[,1:2], loadings[,1:2], xlab=rownames(scores), ylab=rownames(loadings), cex=0.7) # using built-in biplot function

0.0

0.2

0.4

0.6

15

CJ001 CJ011 Na B K Li W Sr As Th Si Ca Cd Ag Ti

PC2

10

Ba

Au Be CJ003 CJ045 d18O CJ005 CJ043 CJ025 CJ041 CJ023 DH CJ047 JK013 CJ009 Sb pH CJ051 CJ017 Zn Cu Hg DC002 JK012 JK014 CJ053 Se Tl CJ063 CJ027 CJ049 CJ007 Mn Co Mo Ni Cr DC008DC011 CJ021 YPb Mg DC012 U CeP CJ015 CJ029 CJ019 La CJ031 CJ039 CJ037 V CJ035 Fe Al SO4

DC006

5 PC1

10

15

5) Do a correlation biplot (see Legendre & Legendre, 1998, p. 404) > quartz(height=7, width=7) > plot(scores[,1]/sd[1], scores[,2]/sd[2], xlab="PCA 1", ylab="PCA 2", type="n") > arrows(0,0,loadings[,1]*sd[1],loadings[,2]*sd[2],

0.0

0.2

0.4

0.6

length=0.1, angle=20, col="red") > text(loadings[,1]*sd[1]*1.2,loadings[,2]*sd[2]*1.2, rownames(loadings), col="red", cex=0.7) > text(scores[,1]/sd[1],scores[,2]/sd[2], rownames(scores), col="blue", cex=0.7) # 1.2 scaling insures that labels are plotted just beyond # the arrows

CJ001 CJ011

PCA 2

Na B K Li W As Sr Ca Ba Th CJ003 Si Cd Ag Ti CJ005 CJ045 CJ043 CJ025 Au CJ041 Be CJ023 d18O Cu DH JK013 pH CJ047 CJ017 CJ009 Sb Zn Hg Se DC002 Tl JK012 JK014 Mn Co CJ051 Cr CJ053 Mg YMo U CePb LaNi CJ063 VP CJ027 CJ049 CJ007 SO4 Fe Al CJ019 DC008 DC011 DC012 CJ015 CJ029 CJ031 CJ039 CJ037 CJ035

CJ021 DC006

1 PCA 1

> > # # # #

quartz(height=7, width=7) biplot(mydata.pca) using built-in biplot function of prcomp(); note that only the top and right axes match the coordinates of the points; also note that still doesn't quite replicate the correlation biplot. Its unclear what this function really does.

0 CJ001

10

15

0.6

0.4

PC2

0.2

Na B K Li W Sr As Ca Ba Th Ti CJ003 Si AgCd CJ005 CJ045 CJ043 CJ025 Au CJ041 Be CJ023 d18O CJ047 CJ009 Cu JK013 pH Hg CJ017 Zn Sb Se DC002 DH Mn Tl JK012 JK014 Co Mo CJ051 Ni Cr CJ053 Mg Pb YP U SO4 La CJ063 CJ027 V CJ049 CJ007Ce Fe Al CJ019DC008 DC012 DC011CJ021 CJ015 CJ029 CJ031 CJ039 CJ037 CJ035 0.0 0.2 PC1 0.4

0.0

0.6

6) Calculate the correlation coecients between variables and principal components > correlations <- t(loadings)*sd # find the correlation of all the variables with all PC's > correlations <- cor(scores,mydata) # another way to find these correlations 7) Plot a scree graph > quartz(height=7, width=7) > plot(mydata.pca) # using built-in function for prcomp; may not show all PCs

DC006

10

15

CJ011

mydata.pca

20 Variances

> quartz(height=7, width=7) > plot(log(sd^2), xlab="principle component", ylab="log(variance)", type="b", pch=16) # using a general plot, with variance on a log scale

10

15

log(variance)

60

40

20

10

15

20

25

30

35

principle component

8) Find variance along each principal component and the eigenvalues > newsd <- sd(scores) > max (sd - newsd) # finds maximum difference between the standard deviation form

# prcomp and the standard deviation calculated longhand; # should be close to zero > eigenvalues <- sd^2 > sum(eigenvalues) # should equal number of variables > length(mydata) # number of variables 9) Save loadings, scores, and singular values to les > write.table(loadings, file="loadings.txt") > write.table(scores, file="scores.txt") > write.table(sd, file="sd.txt")

References

Legendre, P., and L. Legendre, 1998. Numerical Ecology. Elsevier: Amsterdam, 853 p. Swan, A.R.H., and M. Sandilands, 1995. Introduction to Geological Data Analysis. Blackwell Science: Oxford, 446 p.

10

- image processingUploaded byakanaume
- cv_simeoni_matthieu.pdfUploaded byMatthieu Simeoni
- Eigenvalues and EigenvectorsUploaded bystevenspillkumar
- Vegetation analysis and species diversity in the desert ecosystem of coastal wadis of South Sinai, Egypt.pdfUploaded byRs Hawkins
- GenStat 5Uploaded byমাহমুদ অর্ণব
- Stastical Analysis of the Time Course of Biolog Substrate UtilisationUploaded byLateecka R Kulkarni
- Pattern ClassificationUploaded byShashaank Mattur Aswatha
- Liu - 2002 - Block Principal Component Analysis With Application to Gene Microarray Data ClassificationUploaded bygiunghi
- A Comparative Study of Various Inclusive Indices and the Index Constructed by the Principal Components AnalysisUploaded bySudhanshu K Mishra
- Analyses of human dentine and tooth enamel by laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS) to study the diet of medieval Muslim individuals from Tauste (Spain)Uploaded byFranciscoJavierGutiérrezGonzález
- Hw01 Data Frame BasicsUploaded byTom Yan
- Woe Ivr PaperUploaded byRoma Tak
- Blackwell Science, Ltd Phylogenetic variation in heavy metal accumulation in angiospermsUploaded byAnonymous CoUBbG1mL
- What is a Principal Component AnalysingUploaded bysouvik5000
- 3. Ms. Shelly Oberoi_2Uploaded byjitendra sah
- Face Recognition Using Laplacianfaces (Synopsis)Uploaded byMumbai Academics
- Ashheim-02-Multistory Frame Buildings RespondUploaded byollachan
- Ear Biometrics(2)Uploaded bySavithri Arunraj
- Lecture15 - Face Recognition2Uploaded byAkanksha Gupta
- FauciCastSchulze-MusicGenreClassificationUploaded bySagar Bhandari
- Cca Ifaproc2003Uploaded byNikki Roth
- Spectrophotometric-method-for-polyphenols-analysis-Prevalidation-and-application-on-Plantago-L-species_2005_Journal-of-Pharmaceutical-and-Biomedical-A.pdfUploaded byMimi Michelin
- Developing an Empirically Based TypologyUploaded bySarah White
- dpajak14_preprint (1)Uploaded bySHARADKUL
- ToscaniEtAl2007-GEEAUploaded byfreetime8334
- DHPANDYA_EXPERTSYSTEMUploaded byDivyang Pandya
- TEXEMS- Texture Exemplars for Defect Detection on Random Textured SurfacesUploaded by고진석
- PATTERN RECOGNITION USING CONTEXTDEPENDENT MEMORY MODEL (CDMM) IN MULTIMODAL AUTHENTICATION SYSTEMUploaded byijfcstjournal
- 2004_sep_1053-1062Uploaded bySelene Khan Jadoon
- Nguyen Eurographics 08Uploaded byAnonymous t5TDwd

- Applied Multivariate Statistical AnalysisUploaded bythisisdumb7
- Brown1Uploaded byjayroldparcede
- Anderson-Darling TestUploaded bySimil
- Week 4 - ADM2303Uploaded byLucia Woods
- SAMPL-Guidelines-3-13-13 (1).pdfUploaded byLuis Luengo Machuca
- What are confidence intervals and p-values?Uploaded bydrprashantmb1012
- Kalman Filter SlidesUploaded bysaeed14820
- Time Series Regression & Non-StationarityUploaded byDoritosxu
- OM2E_Chapter03Uploaded bykonosuba
- PMBSB Sample Q&AUploaded bysivaramankm
- Excel ProjectUploaded byRahul Koul
- Formula(Stat)Uploaded byJanine Gabis
- 2Uploaded bykochikaghochi
- QBM101Uploaded byShang Bin
- chapter13_stats.pdfUploaded byPoonam Naidu
- 19. Central Tendency and VariabilityUploaded byAli Akbar
- FIN236_Topic03Uploaded byTroy_Valiant
- Stat FlexbookUploaded byleighhelga
- 10. Parameter EstimationUploaded byNurgazy Nazhimidinov
- M347-201806Uploaded byapolaz
- grade 6 unit 6 mini touchstone - sp 2 and sp 3Uploaded byapi-302577842
- Can Public Schools Buy Better-Qualified Teachers_FiglioUploaded byintemperante
- ap stats ch 6 testUploaded byapi-253579853
- DtmUploaded bysgrrsc
- 1.Sampling and QuestionnairesUploaded byChan Tung Ning
- QNT 561 Final Exam - QNT 561 Week 1 Practice Quiz 45 Questions - UOP StudentsUploaded byUOP Students
- Achen, Garbage Can ModelsUploaded byStuart Buck
- Randtests Package in RUploaded byDina Janković
- HypothesisUploaded byElizabeth Emongor
- Level of MeasurementUploaded bygomalstudent