Example 21.1: Analysis of Iris Data Using PROC CANDISC
The iris data published by Fisher (1936) have been widely used
for examples in discriminant analysis and cluster analysis.
The sepal length, sepal width, petal length, and petal width
are measured in millimeters on fifty iris specimens
from each of three species: Iris setosa,
I. versicolor, and I. virginica.
This example is a canonical discriminant analysis
that creates an output data set containing scores on the
canonical variables and plots the canonical variables.
The following statements produce Output 21.1.1 through
Output 21.1.7:
proc format;
value specname
1='Setosa '
2='Versicolor'
3='Virginica ';
run;
data iris;
title 'Fisher (1936) Iris Data';
input SepalLength SepalWidth PetalLength PetalWidth
Species @@;
format Species specname.;
label SepalLength='Sepal Length in mm.'
SepalWidth ='Sepal Width in mm.'
PetalLength='Petal Length in mm.'
PetalWidth ='Petal Width in mm.';
symbol = put(Species, specname10.);
datalines;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;
proc candisc data=iris out=outcan distance anova;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
run;
PROC CANDISC first displays information about the observations
and the classes in the data set in Output 21.1.1.
Output 21.1.1: Iris Data: Summary Information
Observations |
150 |
DF Total |
149 |
Variables |
4 |
DF Within Classes |
147 |
Classes |
3 |
DF Between Classes |
2 |
Class Level Information |
Species |
Variable Name |
Frequency |
Weight |
Proportion |
Setosa |
Setosa |
50 |
50.0000 |
0.333333 |
Versicolor |
Versicolor |
50 |
50.0000 |
0.333333 |
Virginica |
Virginica |
50 |
50.0000 |
0.333333 |
|
The DISTANCE option in the PROC CANDISC statement
displays squared Mahalanobis distances between class means.
Results from the DISTANCE option is shown in
Output 21.1.2 and Output 21.1.3.
Output 21.1.2: Iris Data: Squared Mahalanobis Distances
Squared Distance to Species |
From Species |
Setosa |
Versicolor |
Virginica |
Setosa |
0 |
89.86419 |
179.38471 |
Versicolor |
89.86419 |
0 |
17.20107 |
Virginica |
179.38471 |
17.20107 |
0 |
|
Output 21.1.3: Iris Data: Squared Mahalanobis Distance Statistics
F Statistics, NDF=4, DDF=144 for Squared Distance to Species |
From Species |
Setosa |
Versicolor |
Virginica |
Setosa |
0 |
550.18889 |
1098 |
Versicolor |
550.18889 |
0 |
105.31265 |
Virginica |
1098 |
105.31265 |
0 |
Prob > Mahalanobis Distance for Squared Distance to Species |
From Species |
Setosa |
Versicolor |
Virginica |
Setosa |
1.0000 |
<.0001 |
<.0001 |
Versicolor |
<.0001 |
1.0000 |
<.0001 |
Virginica |
<.0001 |
<.0001 |
1.0000 |
|
The ANOVA option specifies testing of the hypothesis that
the class means are equal using univariate statistics.
The resulting R2 values (see Output 21.1.4) range from 0.4008
for SepalWidth to 0.9414 for PetalLength, and each
variable is significant at the 0.0001 level.
The multivariate test for differences between the classes
(which is displayed by default)
is also significant at the 0.0001 level; you would
expect this from the highly significant univariate test results.
Output 21.1.4: Iris Data: Univariate and Multivariate Statistics
Univariate Test Statistics |
F Statistics, Num DF=2, Den DF=147 |
Variable |
Label |
Total Standard Deviation |
Pooled Standard Deviation |
Between Standard Deviation |
R-Square |
R-Square / (1-RSq) |
F Value |
Pr > F |
SepalLength |
Sepal Length in mm. |
8.2807 |
5.1479 |
7.9506 |
0.6187 |
1.6226 |
119.26 |
<.0001 |
SepalWidth |
Sepal Width in mm. |
4.3587 |
3.3969 |
3.3682 |
0.4008 |
0.6688 |
49.16 |
<.0001 |
PetalLength |
Petal Length in mm. |
17.6530 |
4.3033 |
20.9070 |
0.9414 |
16.0566 |
1180.16 |
<.0001 |
PetalWidth |
Petal Width in mm. |
7.6224 |
2.0465 |
8.9673 |
0.9289 |
13.0613 |
960.01 |
<.0001 |
Average R-Square |
Unweighted |
0.7224358 |
Weighted by Variance |
0.8689444 |
Multivariate Statistics and F Approximations |
S=2 M=0.5 N=71 |
Statistic |
Value |
F Value |
Num DF |
Den DF |
Pr > F |
Wilks' Lambda |
0.02343863 |
199.15 |
8 |
288 |
<.0001 |
Pillai's Trace |
1.19189883 |
53.47 |
8 |
290 |
<.0001 |
Hotelling-Lawley Trace |
32.47732024 |
582.20 |
8 |
203.4 |
<.0001 |
Roy's Greatest Root |
32.19192920 |
1166.96 |
4 |
145 |
<.0001 |
NOTE: |
F Statistic for Roy's Greatest Root is an upper bound. |
|
NOTE: |
F Statistic for Wilks' Lambda is exact. |
|
|
The R2 between Can1 and the class variable, 0.969872,
is much larger than the corresponding R2 for Can2,
0.222027. This is displayed in Output 21.1.5.
Output 21.1.5: Iris Data: Canonical Correlations and Eigenvalues
|
Canonical Correlation |
Adjusted Canonical Correlation |
Approximate Standard Error |
Squared Canonical Correlation |
Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq) |
Test of H0: The canonical correlations in the current row and all that follow are zero |
|
Eigenvalue |
Difference |
Proportion |
Cumulative |
Likelihood Ratio |
Approximate F Value |
Num DF |
Den DF |
Pr > F |
1 |
0.984821 |
0.984508 |
0.002468 |
0.969872 |
32.1919 |
31.9065 |
0.9912 |
0.9912 |
0.02343863 |
199.15 |
8 |
288 |
<.0001 |
2 |
0.471197 |
0.461445 |
0.063734 |
0.222027 |
0.2854 |
|
0.0088 |
1.0000 |
0.77797337 |
13.79 |
3 |
145 |
<.0001 |
|
Output 21.1.6: Iris Data: Correlations Between Canonical and
Original Variables
Total Canonical Structure |
Variable |
Label |
Can1 |
Can2 |
SepalLength |
Sepal Length in mm. |
0.791888 |
0.217593 |
SepalWidth |
Sepal Width in mm. |
-0.530759 |
0.757989 |
PetalLength |
Petal Length in mm. |
0.984951 |
0.046037 |
PetalWidth |
Petal Width in mm. |
0.972812 |
0.222902 |
Between Canonical Structure |
Variable |
Label |
Can1 |
Can2 |
SepalLength |
Sepal Length in mm. |
0.991468 |
0.130348 |
SepalWidth |
Sepal Width in mm. |
-0.825658 |
0.564171 |
PetalLength |
Petal Length in mm. |
0.999750 |
0.022358 |
PetalWidth |
Petal Width in mm. |
0.994044 |
0.108977 |
Pooled Within Canonical Structure |
Variable |
Label |
Can1 |
Can2 |
SepalLength |
Sepal Length in mm. |
0.222596 |
0.310812 |
SepalWidth |
Sepal Width in mm. |
-0.119012 |
0.863681 |
PetalLength |
Petal Length in mm. |
0.706065 |
0.167701 |
PetalWidth |
Petal Width in mm. |
0.633178 |
0.737242 |
|
The raw canonical coefficients (shown in Output 21.1.7)
for the first canonical variable,
Can1, show that the classes differ most widely on the
linear combination of the centered variables -0.0829378 × SepalLength - 0.153447 × SepalWidth + 0.220121 × PetalLength + 0.281046 × PetalWidth.
Output 21.1.7: Iris Data: Canonical Coefficients
Total-Sample Standardized Canonical Coefficients |
Variable |
Label |
Can1 |
Can2 |
SepalLength |
Sepal Length in mm. |
-0.686779533 |
0.019958173 |
SepalWidth |
Sepal Width in mm. |
-0.668825075 |
0.943441829 |
PetalLength |
Petal Length in mm. |
3.885795047 |
-1.645118866 |
PetalWidth |
Petal Width in mm. |
2.142238715 |
2.164135931 |
Pooled Within-Class Standardized Canonical Coefficients |
Variable |
Label |
Can1 |
Can2 |
SepalLength |
Sepal Length in mm. |
-.4269548486 |
0.0124075316 |
SepalWidth |
Sepal Width in mm. |
-.5212416758 |
0.7352613085 |
PetalLength |
Petal Length in mm. |
0.9472572487 |
-.4010378190 |
PetalWidth |
Petal Width in mm. |
0.5751607719 |
0.5810398645 |
|
Raw Canonical Coefficients |
Variable |
Label |
Can1 |
Can2 |
SepalLength |
Sepal Length in mm. |
-.0829377642 |
0.0024102149 |
SepalWidth |
Sepal Width in mm. |
-.1534473068 |
0.2164521235 |
PetalLength |
Petal Length in mm. |
0.2201211656 |
-.0931921210 |
PetalWidth |
Petal Width in mm. |
0.2810460309 |
0.2839187853 |
Class Means on Canonical Variables |
Species |
Can1 |
Can2 |
Setosa |
-7.607599927 |
0.215133017 |
Versicolor |
1.825049490 |
-0.727899622 |
Virginica |
5.782550437 |
0.512766605 |
|
The plot of canonical variables in Output 21.1.8 shows that of the two
canonical variables Can1 has the most discriminatory power. The
following invocation of the %PLOTIT macro creates this plot:
%plotit(data=outcan, plotvars=Can2 Can1,
labelvar=_blank_, symvar=symbol, typevar=symbol,
symsize=1, symlen=4, exttypes=symbol, ls=100,
tsize=1.5, extend=close);
Output 21.1.8: Iris Data: Plot of First Two Canonical Variables
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.