Principal Components example

The data for this example are in Table 9.12 in Johnson and Wichern. They consist of 3 measurements on the sales performance of 50 salespeople for a large firm and 4 test scores. The data begin

In what follows I do PCA on both the sample variance covariance matrix and on the sample correlation matrix. The S code is included as a tutorial for those interested. Read in and print out the data.

```> sales <- matrix(scan("T9-12.DAT"),ncol=7,byrow=T)
> sales
[,1]  [,2]  [,3] [,4] [,5] [,6] [,7]
[1,]  93.0  96.0  97.8    9   12    9   20
[2,]  88.8  91.8  96.8    7   10   10   15
[3,]  95.0 100.3  99.0    8   12    9   26```
Compute the variance covariance matrix and the correlation matrix and print them out.
```> S <- var(sales)
> S
[,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
[1,] 53.83664  68.79409 30.564531 16.579673 17.585224 10.587592  71.69861
[2,] 68.79409 102.50175 40.195082 21.656286 25.561265 10.081306 100.74416
[3,] 30.56453  40.19508 22.205000 13.036531 10.167551  6.463673  42.33510
[4,] 16.57967  21.65629 13.036531 15.603673  7.898367  1.241633  17.17633
[5,] 17.58522  25.56127 10.167551  7.898367 11.456735  2.795102  20.49306
[6,] 10.58759  10.08131  6.463673  1.241633  2.795102  4.577959  12.76980
[7,] 71.69861 100.74416 42.335102 17.176327 20.493061 12.769796 111.04327
> D <- diag(sqrt(diag(S)))
> D
[,1]     [,2]     [,3]     [,4]    [,5]     [,6]     [,7]
[1,] 7.337345  0.00000 0.000000 0.000000 0.00000 0.000000  0.00000
[2,] 0.000000 10.12431 0.000000 0.000000 0.00000 0.000000  0.00000
[3,] 0.000000  0.00000 4.712218 0.000000 0.00000 0.000000  0.00000
[4,] 0.000000  0.00000 0.000000 3.950149 0.00000 0.000000  0.00000
[5,] 0.000000  0.00000 0.000000 0.000000 3.38478 0.000000  0.00000
[6,] 0.000000  0.00000 0.000000 0.000000 0.00000 2.139617  0.00000
[7,] 0.000000  0.00000 0.000000 0.000000 0.00000 0.000000 10.53771
> R <- solve(D,S)%*% solve(D)
> R
[,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
[1,] 1.0000000 0.9260758 0.8840023 0.5720363 0.7080738 0.6744073 0.9273116
[2,] 0.9260758 1.0000000 0.8425232 0.5415080 0.7459097 0.4653880 0.9442960
[3,] 0.8840023 0.8425232 1.0000000 0.7003630 0.6374712 0.6410886 0.8525682
[4,] 0.5720363 0.5415080 0.7003630 1.0000000 0.5907360 0.1469074 0.4126395
[5,] 0.7080738 0.7459097 0.6374712 0.5907360 1.0000000 0.3859502 0.5745533
[6,] 0.6744073 0.4653880 0.6410886 0.1469074 0.3859502 1.0000000 0.5663721
[7,] 0.9273116 0.9442960 0.8525682 0.4126395 0.5745533 0.5663721 1.0000000
> round(R,2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1.00 0.93 0.88 0.57 0.71 0.67 0.93
[2,] 0.93 1.00 0.84 0.54 0.75 0.47 0.94
[3,] 0.88 0.84 1.00 0.70 0.64 0.64 0.85
[4,] 0.57 0.54 0.70 1.00 0.59 0.15 0.41
[5,] 0.71 0.75 0.64 0.59 1.00 0.39 0.57
[6,] 0.67 0.47 0.64 0.15 0.39 1.00 0.57
[7,] 0.93 0.94 0.85 0.41 0.57 0.57 1.00```
Look at the correlations between the three measures of sales performance - all are quite high. Also very high is the correlation of math score with the three sales performance indices. I find it interesting that the correlations amongst the 4 test scores are not all that high.

Compute the eigenvalues and eigenvectors of S. Note that without the symmetric=T argument to eigen you might not get normalized eigenvectors.

```> e.S <- eigen(S,symmetric=T)
> e.S
\$values:
[1] 285.1366314  17.2678388   8.9785502   5.8957342   2.5247046   1.1505297
[7]   0.2710316

\$vectors:
[,1]        [,2]       [,3]        [,4]        [,5]        [,6]
[1,] 0.42055360 -0.12038402 -0.3171150 -0.47136100  0.59883830  0.05743392
[2,] 0.58891939 -0.06544701  0.5939649  0.07524351 -0.05949667  0.53028957
[3,] 0.25070446 -0.25606621 -0.4609316  0.16111410 -0.62885652  0.17742124
[4,] 0.12864131 -0.73754639 -0.1590590  0.41980231  0.25205680 -0.14283133
[5,] 0.14144240 -0.36483937  0.3609846 -0.52252896 -0.34535826 -0.56931521
[6,] 0.07253396  0.05320196 -0.3870509 -0.47051219 -0.24055838  0.30161519
[7,] 0.60962298  0.48553226 -0.1696108  0.27485861 -0.04146119 -0.49852152
[,7]
[1,]  0.35212680
[2,] -0.07541783
[3,]  0.45411767
[4,] -0.39244052
[5,]  0.01215659
[6,] -0.68693393
[7,] -0.19509084```
Store the eigenvectors in P and check that .
```> P.S <- e.S\$vectors
> P.S %*% t(P.S)
[,1]          [,2]          [,3]          [,4]          [,5]
[1,]  1.000000e+00  3.989864e-16  8.326673e-17  3.885781e-16 -5.117434e-17
[2,]  3.989864e-16  1.000000e+00 -6.938894e-18  1.491862e-16  1.315137e-16
[3,]  8.326673e-17 -6.938894e-18  1.000000e+00  3.885781e-16 -1.830133e-16
[4,]  3.885781e-16  1.491862e-16  3.885781e-16  1.000000e+00  3.452100e-16
[5,] -5.117434e-17  1.315137e-16 -1.830133e-16  3.452100e-16  1.000000e+00
[6,] -9.159340e-16  3.053113e-16  0.000000e+00 -5.551115e-17 -3.035766e-16
[7,]  1.804112e-16 -1.006140e-16  2.775558e-17 -3.608225e-16 -1.517883e-17
[,6]          [,7]
[1,] -9.159340e-16  1.804112e-16
[2,]  3.053113e-16 -1.006140e-16
[3,]  0.000000e+00  2.775558e-17
[4,] -5.551115e-17 -3.608225e-16
[5,] -3.035766e-16 -1.517883e-17
[6,]  1.000000e+00 -5.551115e-17
[7,] -5.551115e-17  1.000000e+00
> round(.Last.value,6)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    0    0    0    0    0    0
[2,]    0    1    0    0    0    0    0
[3,]    0    0    1    0    0    0    0
[4,]    0    0    0    1    0    0    0
[5,]    0    0    0    0    1    0    0
[6,]    0    0    0    0    0    1    0
[7,]    0    0    0    0    0    0    1```
Put the eigenvalues in a diagonal matrix (called L here).
```> L.S <- diag(e.S\$values)
> L.S
[,1]     [,2]    [,3]     [,4]     [,5]    [,6]      [,7]
[1,] 285.1366  0.00000 0.00000 0.000000 0.000000 0.00000 0.0000000
[2,]   0.0000 17.26784 0.00000 0.000000 0.000000 0.00000 0.0000000
[3,]   0.0000  0.00000 8.97855 0.000000 0.000000 0.00000 0.0000000
[4,]   0.0000  0.00000 0.00000 5.895734 0.000000 0.00000 0.0000000
[5,]   0.0000  0.00000 0.00000 0.000000 2.524705 0.00000 0.0000000
[6,]   0.0000  0.00000 0.00000 0.000000 0.000000 1.15053 0.0000000
[7,]   0.0000  0.00000 0.00000 0.000000 0.000000 0.00000 0.2710316```
Check the identity
```> P.S%*% L.S %*% t(P.S) - S
[,1]          [,2]          [,3]          [,4]          [,5]
[1,] -2.131628e-14 -2.842171e-14 -1.065814e-14  1.421085e-14 -3.552714e-15
[2,] -2.842171e-14 -9.947598e-14 -3.552714e-14  3.552714e-15 -1.421085e-14
[3,] -7.105427e-15 -2.842171e-14 -7.105427e-15  1.243450e-14 -5.329071e-15
[4,]  1.776357e-14  7.105427e-15  1.243450e-14  1.243450e-14  8.881784e-15
[5,] -3.552714e-15 -1.421085e-14 -3.552714e-15  8.881784e-15 -7.105427e-15
[6,] -8.526513e-14 -7.815970e-14 -4.263256e-14 -9.992007e-15 -2.753353e-14
[7,]  2.842171e-14 -2.842171e-14  0.000000e+00  3.552714e-15 -3.552714e-15
[,6]          [,7]
[1,] -8.348877e-14  1.421085e-14
[2,] -7.815970e-14 -4.263256e-14
[3,] -4.263256e-14  0.000000e+00
[4,] -9.992007e-15  3.552714e-15
[5,] -2.664535e-14  0.000000e+00
[6,] -5.240253e-14 -3.907985e-14
[7,] -3.730349e-14  5.684342e-14```
Watch what happens if you don't use symmetric=T.
```> e.R <- eigen(R)
> e.R
\$values:
[1] 5.03459779 0.93351614 0.49791975 0.42124549 0.08104043 0.02034063
0.01133977

\$vectors:
[,1]         [,2]         [,3]        [,4]         [,5]
[,6]
[1,] 0.7269274 -0.116207095 -0.056760689  0.03974777 -0.687192227
-0.28993184
[2,] 0.7043683  0.030454407 -0.332704938 -0.01008627  0.000128432
0.67646542
[3,] 0.7057720  0.009568613  0.153532259  0.30479417  0.761650850
-0.13507161
[4,] 0.4932870  0.695047750  0.339482201  0.28395450 -0.283579937
0.09834274
[5,] 0.5851533  0.306695959  0.004452648 -0.79414371  0.189333749
-0.16961040
[6,] 0.4847059 -0.667972463  0.453988739 -0.14415168 -0.094478767
0.20350664
[7,] 0.6828969 -0.208350975 -0.326359275  0.23076897  0.053870538
-0.31966116
[,7]
[1,] -0.45877388
[2,] -0.08646865
[3,] -0.34694455
[4,]  0.26071817
[5,]  0.06285144
[6,]  0.19855797
[7,]  0.55299119
> P.R <- e.R\$vectors
> P.R %*% t(P.R)
[,1]      [,2]      [,3]       [,4]      [,5]       [,6]
[,7]
[1,] 1.3134963 0.3704216 0.1902634  0.3165822 0.2481377  0.3132995
0.3502879
[2,] 0.3704216 1.0729387 0.3819860  0.2967566 0.3078862  0.2919624
0.3168702
[3,] 0.1902634 0.3819860 1.3334041  0.1737410 0.3198625  0.1931292
0.3925566
[4,] 0.3165822 0.2967566 0.1737410  1.0803646 0.2238431 -0.0134117
0.2442469
[5,] 0.2481377 0.3078862 0.3198625  0.2238431 1.1357161  0.1553362
0.2501557
[6,] 0.3132995 0.2919624 0.1931292 -0.0134117 0.1553362  0.9977790
0.3284057
[7,] 0.3502879 0.3168702 0.3925566  0.2442469 0.2501557  0.3284057
1.0804075```
This should have been the identity but isn't. Now make a plot of the first principal component versus the second.
```> Y.S <- sales %*% P.S
> postscript("prcmps_S.ps",horizontal=F)
> plot(Y.S[,1],Y.S[,2],xlab="First PC",ylab="Second PC",main="Plot of 1st
vs 2nd PC\n us ing S")
> dev.off()
Generated postscript file "prcmps_S.ps".```
Now try to interpret the eigenvectors
```> P.S
[,1]        [,2]       [,3]        [,4]        [,5]        [,6]
[1,] 0.42055360 -0.12038402 -0.3171150 -0.47136100  0.59883830  0.05743392
[2,] 0.58891939 -0.06544701  0.5939649  0.07524351 -0.05949667  0.53028957
[3,] 0.25070446 -0.25606621 -0.4609316  0.16111410 -0.62885652  0.17742124
[4,] 0.12864131 -0.73754639 -0.1590590  0.41980231  0.25205680 -0.14283133
[5,] 0.14144240 -0.36483937  0.3609846 -0.52252896 -0.34535826 -0.56931521
[6,] 0.07253396  0.05320196 -0.3870509 -0.47051219 -0.24055838  0.30161519
[7,] 0.60962298  0.48553226 -0.1696108  0.27485861 -0.04146119 -0.49852152
[,7]
[1,]  0.35212680
[2,] -0.07541783
[3,]  0.45411767
[4,] -0.39244052
[5,]  0.01215659
[6,] -0.68693393
[7,] -0.19509084```
The leading principal component used the first column of P as coefficients. It is hard to interpret although all the weights are positive. The big weights are on the components with large variances which is natural but uninformative. The trouble is that the 7 variables are not all really comparable. Now we do the whole thing again with R.
```> e.R <- eigen(R,symmetric=T)
> e.R
\$values:
[1] 5.03459779 0.93351614 0.49791975 0.42124549 0.08104043 0.02034063
0.01133977

\$vectors:
[,1]         [,2]         [,3]        [,4]          [,5] [,6]
[1,] 0.4336719  0.111754422 -0.075488541  0.04237344 -0.6324942624 -0.3365963
[2,] 0.4202136 -0.029287495 -0.442478953 -0.01075255  0.0001182093 0.7853424
[3,] 0.4210510 -0.009201975  0.204189315  0.32492838  0.7010262539 -0.1568114
[4,] 0.2942863 -0.668415809  0.451492333  0.30271208 -0.2610080204 0.1141710
[5,] 0.3490920 -0.294944379  0.005921773 -0.84660356  0.1742634819 -0.1969091
[6,] 0.2891669  0.642377957  0.603779622 -0.15367411 -0.0869586057 0.2362610
[7,] 0.4074041  0.200367651 -0.434039576  0.24601320  0.0495826418 -0.3711105
[,7]
[1,]  0.52782527
[2,]  0.09948330
[3,]  0.39916419
[4,] -0.29995962
[5,] -0.07231139
[6,] -0.22844351
[7,] -0.63622351

> P.R <- e.R\$vectors
> L.R <- diag(e.R\$values)
> L.R
[,1]      [,2]      [,3]      [,4]       [,5]       [,6] [,7]
[1,] 5.034598 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000
[2,] 0.000000 0.9335161 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000
[3,] 0.000000 0.0000000 0.4979197 0.0000000 0.00000000 0.00000000 0.00000000
[4,] 0.000000 0.0000000 0.0000000 0.4212455 0.00000000 0.00000000 0.00000000
[5,] 0.000000 0.0000000 0.0000000 0.0000000 0.08104043 0.00000000 0.00000000
[6,] 0.000000 0.0000000 0.0000000 0.0000000 0.00000000 0.02034063 0.00000000
[7,] 0.000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.01133977```
Now lets try to interpret the eigenvectors for the correlation matrix.
```> P.R
[,1]         [,2]         [,3]        [,4]          [,5]    [,6]
[1,] 0.4336719  0.111754422 -0.075488541  0.04237344 -0.6324942624 -0.3365963
[2,] 0.4202136 -0.029287495 -0.442478953 -0.01075255  0.0001182093  0.7853424
[3,] 0.4210510 -0.009201975  0.204189315  0.32492838  0.7010262539 -0.1568114
[4,] 0.2942863 -0.668415809  0.451492333  0.30271208 -0.2610080204  0.1141710
[5,] 0.3490920 -0.294944379  0.005921773 -0.84660356  0.1742634819 -0.1969091
[6,] 0.2891669  0.642377957  0.603779622 -0.15367411 -0.0869586057  0.2362610
[7,] 0.4074041  0.200367651 -0.434039576  0.24601320  0.0495826418 -0.3711105
[,7]
[1,]  0.52782527
[2,]  0.09948330
[3,]  0.39916419
[4,] -0.29995962
[5,] -0.07231139
[6,] -0.22844351
[7,] -0.63622351```
The entries in the first column are nearly constant so that the leading principal component is close to just being an average of the 7 variables, each expressed in standard units ( ). THe next principal component is essentially a weighted average of Mathematics and Abstract Reasoning minus a similar average of Mechanical reasoning and Creativity. The third PC seems hard to interpret.

Now transform the data to get the principal components . Then plot the first two principal components against each other and examine the plot looking for interesting cases.

```> postscript("prcmps_R.ps", horizontal = F)
> plot(Y.R[, 1], Y.R[, 2], xlab = "First PC", ylab = "Second PC", main =
+         "Plot of 1st vs 2nd PC\n using R")
> l1 <- (abs(Y.R[,1])>25)
> l1
[1] F F F F F F F T F F F F F F F T F F F F F F F F F F F F F F F F F F F
F F F
[39] F F F F F T F F F T F F
> l2 <- (abs(Y.R[,2]) > 6)
> l2
[1] F F F F F F F F F F F F F F F F F F F F F T F F F F F F F F F T F F F
T T F
[39] F F F T F F F F F F F F
> l3 <- (l1|l2)
> l3
[1] F F F F F F F T F F F F F F F T F F F F F T F F F F F F F F F T F F F
T T F
[39] F F F T F T F F F T F F
> text(Y.R[,1],Y.R[,2],labels=l3)
> dev.off()
Generated postscript file "prcmps_R.ps".
null device
1```
Here is the plot.