Principal Components example
The data for this example are in Table 9.12 in Johnson and Wichern. They consist of 3 measurements on the sales performance of 50 salespeople for a large firm and 4 test scores. The data begin
In what follows I do PCA on both the sample variance covariance matrix and on the sample correlation matrix. The S code is included as a tutorial for those interested. Read in and print out the data.
> sales <- matrix(scan("T9-12.DAT"),ncol=7,byrow=T) > sales [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 93.0 96.0 97.8 9 12 9 20 [2,] 88.8 91.8 96.8 7 10 10 15 [3,] 95.0 100.3 99.0 8 12 9 26Compute the variance covariance matrix and the correlation matrix and print them out.
> S <- var(sales) > S [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 53.83664 68.79409 30.564531 16.579673 17.585224 10.587592 71.69861 [2,] 68.79409 102.50175 40.195082 21.656286 25.561265 10.081306 100.74416 [3,] 30.56453 40.19508 22.205000 13.036531 10.167551 6.463673 42.33510 [4,] 16.57967 21.65629 13.036531 15.603673 7.898367 1.241633 17.17633 [5,] 17.58522 25.56127 10.167551 7.898367 11.456735 2.795102 20.49306 [6,] 10.58759 10.08131 6.463673 1.241633 2.795102 4.577959 12.76980 [7,] 71.69861 100.74416 42.335102 17.176327 20.493061 12.769796 111.04327 > D <- diag(sqrt(diag(S))) > D [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 7.337345 0.00000 0.000000 0.000000 0.00000 0.000000 0.00000 [2,] 0.000000 10.12431 0.000000 0.000000 0.00000 0.000000 0.00000 [3,] 0.000000 0.00000 4.712218 0.000000 0.00000 0.000000 0.00000 [4,] 0.000000 0.00000 0.000000 3.950149 0.00000 0.000000 0.00000 [5,] 0.000000 0.00000 0.000000 0.000000 3.38478 0.000000 0.00000 [6,] 0.000000 0.00000 0.000000 0.000000 0.00000 2.139617 0.00000 [7,] 0.000000 0.00000 0.000000 0.000000 0.00000 0.000000 10.53771 > R <- solve(D,S)%*% solve(D) > R [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1.0000000 0.9260758 0.8840023 0.5720363 0.7080738 0.6744073 0.9273116 [2,] 0.9260758 1.0000000 0.8425232 0.5415080 0.7459097 0.4653880 0.9442960 [3,] 0.8840023 0.8425232 1.0000000 0.7003630 0.6374712 0.6410886 0.8525682 [4,] 0.5720363 0.5415080 0.7003630 1.0000000 0.5907360 0.1469074 0.4126395 [5,] 0.7080738 0.7459097 0.6374712 0.5907360 1.0000000 0.3859502 0.5745533 [6,] 0.6744073 0.4653880 0.6410886 0.1469074 0.3859502 1.0000000 0.5663721 [7,] 0.9273116 0.9442960 0.8525682 0.4126395 0.5745533 0.5663721 1.0000000 > round(R,2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1.00 0.93 0.88 0.57 0.71 0.67 0.93 [2,] 0.93 1.00 0.84 0.54 0.75 0.47 0.94 [3,] 0.88 0.84 1.00 0.70 0.64 0.64 0.85 [4,] 0.57 0.54 0.70 1.00 0.59 0.15 0.41 [5,] 0.71 0.75 0.64 0.59 1.00 0.39 0.57 [6,] 0.67 0.47 0.64 0.15 0.39 1.00 0.57 [7,] 0.93 0.94 0.85 0.41 0.57 0.57 1.00Look at the correlations between the three measures of sales performance - all are quite high. Also very high is the correlation of math score with the three sales performance indices. I find it interesting that the correlations amongst the 4 test scores are not all that high.
Compute the eigenvalues and eigenvectors of S. Note that without the symmetric=T argument to eigen you might not get normalized eigenvectors.
> e.S <- eigen(S,symmetric=T) > e.S $values: [1] 285.1366314 17.2678388 8.9785502 5.8957342 2.5247046 1.1505297 [7] 0.2710316 $vectors: [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.42055360 -0.12038402 -0.3171150 -0.47136100 0.59883830 0.05743392 [2,] 0.58891939 -0.06544701 0.5939649 0.07524351 -0.05949667 0.53028957 [3,] 0.25070446 -0.25606621 -0.4609316 0.16111410 -0.62885652 0.17742124 [4,] 0.12864131 -0.73754639 -0.1590590 0.41980231 0.25205680 -0.14283133 [5,] 0.14144240 -0.36483937 0.3609846 -0.52252896 -0.34535826 -0.56931521 [6,] 0.07253396 0.05320196 -0.3870509 -0.47051219 -0.24055838 0.30161519 [7,] 0.60962298 0.48553226 -0.1696108 0.27485861 -0.04146119 -0.49852152 [,7] [1,] 0.35212680 [2,] -0.07541783 [3,] 0.45411767 [4,] -0.39244052 [5,] 0.01215659 [6,] -0.68693393 [7,] -0.19509084Store the eigenvectors in P and check that
> P.S <- e.S$vectors > P.S %*% t(P.S) [,1] [,2] [,3] [,4] [,5] [1,] 1.000000e+00 3.989864e-16 8.326673e-17 3.885781e-16 -5.117434e-17 [2,] 3.989864e-16 1.000000e+00 -6.938894e-18 1.491862e-16 1.315137e-16 [3,] 8.326673e-17 -6.938894e-18 1.000000e+00 3.885781e-16 -1.830133e-16 [4,] 3.885781e-16 1.491862e-16 3.885781e-16 1.000000e+00 3.452100e-16 [5,] -5.117434e-17 1.315137e-16 -1.830133e-16 3.452100e-16 1.000000e+00 [6,] -9.159340e-16 3.053113e-16 0.000000e+00 -5.551115e-17 -3.035766e-16 [7,] 1.804112e-16 -1.006140e-16 2.775558e-17 -3.608225e-16 -1.517883e-17 [,6] [,7] [1,] -9.159340e-16 1.804112e-16 [2,] 3.053113e-16 -1.006140e-16 [3,] 0.000000e+00 2.775558e-17 [4,] -5.551115e-17 -3.608225e-16 [5,] -3.035766e-16 -1.517883e-17 [6,] 1.000000e+00 -5.551115e-17 [7,] -5.551115e-17 1.000000e+00 > round(.Last.value,6) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1 0 0 0 0 0 0 [2,] 0 1 0 0 0 0 0 [3,] 0 0 1 0 0 0 0 [4,] 0 0 0 1 0 0 0 [5,] 0 0 0 0 1 0 0 [6,] 0 0 0 0 0 1 0 [7,] 0 0 0 0 0 0 1Put the eigenvalues in a diagonal matrix
> L.S <- diag(e.S$values) > L.S [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 285.1366 0.00000 0.00000 0.000000 0.000000 0.00000 0.0000000 [2,] 0.0000 17.26784 0.00000 0.000000 0.000000 0.00000 0.0000000 [3,] 0.0000 0.00000 8.97855 0.000000 0.000000 0.00000 0.0000000 [4,] 0.0000 0.00000 0.00000 5.895734 0.000000 0.00000 0.0000000 [5,] 0.0000 0.00000 0.00000 0.000000 2.524705 0.00000 0.0000000 [6,] 0.0000 0.00000 0.00000 0.000000 0.000000 1.15053 0.0000000 [7,] 0.0000 0.00000 0.00000 0.000000 0.000000 0.00000 0.2710316Check the identity
> P.S%*% L.S %*% t(P.S) - S [,1] [,2] [,3] [,4] [,5] [1,] -2.131628e-14 -2.842171e-14 -1.065814e-14 1.421085e-14 -3.552714e-15 [2,] -2.842171e-14 -9.947598e-14 -3.552714e-14 3.552714e-15 -1.421085e-14 [3,] -7.105427e-15 -2.842171e-14 -7.105427e-15 1.243450e-14 -5.329071e-15 [4,] 1.776357e-14 7.105427e-15 1.243450e-14 1.243450e-14 8.881784e-15 [5,] -3.552714e-15 -1.421085e-14 -3.552714e-15 8.881784e-15 -7.105427e-15 [6,] -8.526513e-14 -7.815970e-14 -4.263256e-14 -9.992007e-15 -2.753353e-14 [7,] 2.842171e-14 -2.842171e-14 0.000000e+00 3.552714e-15 -3.552714e-15 [,6] [,7] [1,] -8.348877e-14 1.421085e-14 [2,] -7.815970e-14 -4.263256e-14 [3,] -4.263256e-14 0.000000e+00 [4,] -9.992007e-15 3.552714e-15 [5,] -2.664535e-14 0.000000e+00 [6,] -5.240253e-14 -3.907985e-14 [7,] -3.730349e-14 5.684342e-14Watch what happens if you don't use symmetric=T.
> e.R <- eigen(R) > e.R $values: [1] 5.03459779 0.93351614 0.49791975 0.42124549 0.08104043 0.02034063 0.01133977 $vectors: [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.7269274 -0.116207095 -0.056760689 0.03974777 -0.687192227 -0.28993184 [2,] 0.7043683 0.030454407 -0.332704938 -0.01008627 0.000128432 0.67646542 [3,] 0.7057720 0.009568613 0.153532259 0.30479417 0.761650850 -0.13507161 [4,] 0.4932870 0.695047750 0.339482201 0.28395450 -0.283579937 0.09834274 [5,] 0.5851533 0.306695959 0.004452648 -0.79414371 0.189333749 -0.16961040 [6,] 0.4847059 -0.667972463 0.453988739 -0.14415168 -0.094478767 0.20350664 [7,] 0.6828969 -0.208350975 -0.326359275 0.23076897 0.053870538 -0.31966116 [,7] [1,] -0.45877388 [2,] -0.08646865 [3,] -0.34694455 [4,] 0.26071817 [5,] 0.06285144 [6,] 0.19855797 [7,] 0.55299119 > P.R <- e.R$vectors > P.R %*% t(P.R) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1.3134963 0.3704216 0.1902634 0.3165822 0.2481377 0.3132995 0.3502879 [2,] 0.3704216 1.0729387 0.3819860 0.2967566 0.3078862 0.2919624 0.3168702 [3,] 0.1902634 0.3819860 1.3334041 0.1737410 0.3198625 0.1931292 0.3925566 [4,] 0.3165822 0.2967566 0.1737410 1.0803646 0.2238431 -0.0134117 0.2442469 [5,] 0.2481377 0.3078862 0.3198625 0.2238431 1.1357161 0.1553362 0.2501557 [6,] 0.3132995 0.2919624 0.1931292 -0.0134117 0.1553362 0.9977790 0.3284057 [7,] 0.3502879 0.3168702 0.3925566 0.2442469 0.2501557 0.3284057 1.0804075This should have been the identity but isn't. Now make a plot of the first principal component versus the second.
> Y.S <- sales %*% P.S > postscript("prcmps_S.ps",horizontal=F) > plot(Y.S[,1],Y.S[,2],xlab="First PC",ylab="Second PC",main="Plot of 1st vs 2nd PC\n us ing S") > dev.off() Generated postscript file "prcmps_S.ps".Now try to interpret the eigenvectors
> P.S [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.42055360 -0.12038402 -0.3171150 -0.47136100 0.59883830 0.05743392 [2,] 0.58891939 -0.06544701 0.5939649 0.07524351 -0.05949667 0.53028957 [3,] 0.25070446 -0.25606621 -0.4609316 0.16111410 -0.62885652 0.17742124 [4,] 0.12864131 -0.73754639 -0.1590590 0.41980231 0.25205680 -0.14283133 [5,] 0.14144240 -0.36483937 0.3609846 -0.52252896 -0.34535826 -0.56931521 [6,] 0.07253396 0.05320196 -0.3870509 -0.47051219 -0.24055838 0.30161519 [7,] 0.60962298 0.48553226 -0.1696108 0.27485861 -0.04146119 -0.49852152 [,7] [1,] 0.35212680 [2,] -0.07541783 [3,] 0.45411767 [4,] -0.39244052 [5,] 0.01215659 [6,] -0.68693393 [7,] -0.19509084The leading principal component used the first column of P as coefficients. It is hard to interpret although all the weights are positive. The big weights are on the components with large variances which is natural but uninformative. The trouble is that the 7 variables are not all really comparable. Now we do the whole thing again with R.
> e.R <- eigen(R,symmetric=T) > e.R $values: [1] 5.03459779 0.93351614 0.49791975 0.42124549 0.08104043 0.02034063 0.01133977 $vectors: [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.4336719 0.111754422 -0.075488541 0.04237344 -0.6324942624 -0.3365963 [2,] 0.4202136 -0.029287495 -0.442478953 -0.01075255 0.0001182093 0.7853424 [3,] 0.4210510 -0.009201975 0.204189315 0.32492838 0.7010262539 -0.1568114 [4,] 0.2942863 -0.668415809 0.451492333 0.30271208 -0.2610080204 0.1141710 [5,] 0.3490920 -0.294944379 0.005921773 -0.84660356 0.1742634819 -0.1969091 [6,] 0.2891669 0.642377957 0.603779622 -0.15367411 -0.0869586057 0.2362610 [7,] 0.4074041 0.200367651 -0.434039576 0.24601320 0.0495826418 -0.3711105 [,7] [1,] 0.52782527 [2,] 0.09948330 [3,] 0.39916419 [4,] -0.29995962 [5,] -0.07231139 [6,] -0.22844351 [7,] -0.63622351 > P.R <- e.R$vectors > L.R <- diag(e.R$values) > L.R [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 5.034598 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 [2,] 0.000000 0.9335161 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000 [3,] 0.000000 0.0000000 0.4979197 0.0000000 0.00000000 0.00000000 0.00000000 [4,] 0.000000 0.0000000 0.0000000 0.4212455 0.00000000 0.00000000 0.00000000 [5,] 0.000000 0.0000000 0.0000000 0.0000000 0.08104043 0.00000000 0.00000000 [6,] 0.000000 0.0000000 0.0000000 0.0000000 0.00000000 0.02034063 0.00000000 [7,] 0.000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.01133977Now lets try to interpret the eigenvectors for the correlation matrix.
> P.R [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.4336719 0.111754422 -0.075488541 0.04237344 -0.6324942624 -0.3365963 [2,] 0.4202136 -0.029287495 -0.442478953 -0.01075255 0.0001182093 0.7853424 [3,] 0.4210510 -0.009201975 0.204189315 0.32492838 0.7010262539 -0.1568114 [4,] 0.2942863 -0.668415809 0.451492333 0.30271208 -0.2610080204 0.1141710 [5,] 0.3490920 -0.294944379 0.005921773 -0.84660356 0.1742634819 -0.1969091 [6,] 0.2891669 0.642377957 0.603779622 -0.15367411 -0.0869586057 0.2362610 [7,] 0.4074041 0.200367651 -0.434039576 0.24601320 0.0495826418 -0.3711105 [,7] [1,] 0.52782527 [2,] 0.09948330 [3,] 0.39916419 [4,] -0.29995962 [5,] -0.07231139 [6,] -0.22844351 [7,] -0.63622351The entries in the first column are nearly constant so that the leading principal component is close to just being an average of the 7 variables, each expressed in standard units (
Now transform the data to get the principal components .
Then plot the first two principal components against each other and
examine the plot looking for interesting cases.
> postscript("prcmps_R.ps", horizontal = F) > plot(Y.R[, 1], Y.R[, 2], xlab = "First PC", ylab = "Second PC", main = + "Plot of 1st vs 2nd PC\n using R") > l1 <- (abs(Y.R[,1])>25) > l1 [1] F F F F F F F T F F F F F F F T F F F F F F F F F F F F F F F F F F F F F F [39] F F F F F T F F F T F F > l2 <- (abs(Y.R[,2]) > 6) > l2 [1] F F F F F F F F F F F F F F F F F F F F F T F F F F F F F F F T F F F T T F [39] F F F T F F F F F F F F > l3 <- (l1|l2) > l3 [1] F F F F F F F T F F F F F F F T F F F F F T F F F F F F F F F T F F F T T F [39] F F F T F T F F F T F F > text(Y.R[,1],Y.R[,2],labels=l3) > dev.off() Generated postscript file "prcmps_R.ps". null device 1Here is the plot.