Example 21.1: Analysis of Iris Data Using PROC CANDISC

The CANDISC Procedure

Example 21.1: Analysis of Iris Data Using PROC CANDISC

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species: Iris setosa, I. versicolor, and I. virginica.

This example is a canonical discriminant analysis that creates an output data set containing scores on the canonical variables and plots the canonical variables. The following statements produce Output 21.1.1 through Output 21.1.7:

   proc format;
      value specname
         1='Setosa    '
         2='Versicolor'
         3='Virginica ';
   run;

   data iris;
      title 'Fisher (1936) Iris Data';
      input SepalLength SepalWidth PetalLength PetalWidth 
            Species @@;
      format Species specname.;
      label SepalLength='Sepal Length in mm.'
            SepalWidth ='Sepal Width in mm.'
            PetalLength='Petal Length in mm.'
            PetalWidth ='Petal Width in mm.';
      symbol = put(Species, specname10.);
      datalines;
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
   63 33 60 25 3 53 37 15 02 1
   ;
   proc candisc data=iris out=outcan distance anova;
      class Species;
      var SepalLength SepalWidth PetalLength PetalWidth;
   run;

PROC CANDISC first displays information about the observations and the classes in the data set in Output 21.1.1.

Output 21.1.1: Iris Data: Summary Information

Fisher (1936) Iris Data

The CANDISC Procedure

Observations	150	DF Total	149
Variables	4	DF Within Classes	147
Classes	3	DF Between Classes	2

Class Level Information
Species	Variable Name	Frequency	Weight	Proportion
Setosa	Setosa	50	50.0000	0.333333
Versicolor	Versicolor	50	50.0000	0.333333
Virginica	Virginica	50	50.0000	0.333333

The DISTANCE option in the PROC CANDISC statement displays squared Mahalanobis distances between class means. Results from the DISTANCE option is shown in Output 21.1.2 and Output 21.1.3.

Output 21.1.2: Iris Data: Squared Mahalanobis Distances

Fisher (1936) Iris Data

The CANDISC Procedure

Squared Distance to Species
From Species	Setosa	Versicolor	Virginica
Setosa	0	89.86419	179.38471
Versicolor	89.86419	0	17.20107
Virginica	179.38471	17.20107	0

Output 21.1.3: Iris Data: Squared Mahalanobis Distance Statistics

Fisher (1936) Iris Data

The CANDISC Procedure

F Statistics, NDF=4, DDF=144 for Squared Distance to Species
From Species	Setosa	Versicolor	Virginica
Setosa	0	550.18889	1098
Versicolor	550.18889	0	105.31265
Virginica	1098	105.31265	0

Prob > Mahalanobis Distance for Squared Distance to Species
From Species	Setosa	Versicolor	Virginica
Setosa	1.0000	<.0001	<.0001
Versicolor	<.0001	1.0000	<.0001
Virginica	<.0001	<.0001	1.0000

The ANOVA option specifies testing of the hypothesis that the class means are equal using univariate statistics. The resulting R² values (see Output 21.1.4) range from 0.4008 for SepalWidth to 0.9414 for PetalLength, and each variable is significant at the 0.0001 level. The multivariate test for differences between the classes (which is displayed by default) is also significant at the 0.0001 level; you would expect this from the highly significant univariate test results.

Output 21.1.4: Iris Data: Univariate and Multivariate Statistics

Fisher (1936) Iris Data

The CANDISC Procedure

Univariate Test Statistics
F Statistics, Num DF=2, Den DF=147
Variable	Label	Total Standard Deviation	Pooled Standard Deviation	Between Standard Deviation	R-Square	R-Square / (1-RSq)	F Value	Pr > F
SepalLength	Sepal Length in mm.	8.2807	5.1479	7.9506	0.6187	1.6226	119.26	<.0001
SepalWidth	Sepal Width in mm.	4.3587	3.3969	3.3682	0.4008	0.6688	49.16	<.0001
PetalLength	Petal Length in mm.	17.6530	4.3033	20.9070	0.9414	16.0566	1180.16	<.0001
PetalWidth	Petal Width in mm.	7.6224	2.0465	8.9673	0.9289	13.0613	960.01	<.0001

Average R-Square
Unweighted	0.7224358
Weighted by Variance	0.8689444

Multivariate Statistics and F Approximations
S=2 M=0.5 N=71
Statistic	Value	F Value	Num DF	Den DF	Pr > F
Wilks' Lambda	0.02343863	199.15	8	288	<.0001
Pillai's Trace	1.19189883	53.47	8	290	<.0001
Hotelling-Lawley Trace	32.47732024	582.20	8	203.4	<.0001
Roy's Greatest Root	32.19192920	1166.96	4	145	<.0001

NOTE:

F Statistic for Roy's Greatest Root is an upper bound.

NOTE:

F Statistic for Wilks' Lambda is exact.

The R² between Can1 and the class variable, 0.969872, is much larger than the corresponding R² for Can2, 0.222027. This is displayed in Output 21.1.5.

Output 21.1.5: Iris Data: Canonical Correlations and Eigenvalues

Fisher (1936) Iris Data

The CANDISC Procedure

	Canonical Correlation	Adjusted Canonical Correlation	Approximate Standard Error	Squared Canonical Correlation	Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq)				Test of H0: The canonical correlations in the current row and all that follow are zero
	Canonical Correlation	Adjusted Canonical Correlation	Approximate Standard Error	Squared Canonical Correlation	Eigenvalue	Difference	Proportion	Cumulative	Likelihood Ratio	Approximate F Value	Num DF	Den DF	Pr > F
1	0.984821	0.984508	0.002468	0.969872	32.1919	31.9065	0.9912	0.9912	0.02343863	199.15	8	288	<.0001
2	0.471197	0.461445	0.063734	0.222027	0.2854		0.0088	1.0000	0.77797337	13.79	3	145	<.0001

Output 21.1.6: Iris Data: Correlations Between Canonical and Original Variables

Fisher (1936) Iris Data

The CANDISC Procedure

Total Canonical Structure
Variable	Label	Can1	Can2
SepalLength	Sepal Length in mm.	0.791888	0.217593
SepalWidth	Sepal Width in mm.	-0.530759	0.757989
PetalLength	Petal Length in mm.	0.984951	0.046037
PetalWidth	Petal Width in mm.	0.972812	0.222902

Between Canonical Structure
Variable	Label	Can1	Can2
SepalLength	Sepal Length in mm.	0.991468	0.130348
SepalWidth	Sepal Width in mm.	-0.825658	0.564171
PetalLength	Petal Length in mm.	0.999750	0.022358
PetalWidth	Petal Width in mm.	0.994044	0.108977

Pooled Within Canonical Structure
Variable	Label	Can1	Can2
SepalLength	Sepal Length in mm.	0.222596	0.310812
SepalWidth	Sepal Width in mm.	-0.119012	0.863681
PetalLength	Petal Length in mm.	0.706065	0.167701
PetalWidth	Petal Width in mm.	0.633178	0.737242

The raw canonical coefficients (shown in Output 21.1.7) for the first canonical variable, Can1, show that the classes differ most widely on the linear combination of the centered variables -0.0829378 × SepalLength - 0.153447 × SepalWidth + 0.220121 × PetalLength + 0.281046 × PetalWidth.

Output 21.1.7: Iris Data: Canonical Coefficients

Fisher (1936) Iris Data

The CANDISC Procedure

Total-Sample Standardized Canonical Coefficients
Variable	Label	Can1	Can2
SepalLength	Sepal Length in mm.	-0.686779533	0.019958173
SepalWidth	Sepal Width in mm.	-0.668825075	0.943441829
PetalLength	Petal Length in mm.	3.885795047	-1.645118866
PetalWidth	Petal Width in mm.	2.142238715	2.164135931

Pooled Within-Class Standardized Canonical Coefficients
Variable	Label	Can1	Can2
SepalLength	Sepal Length in mm.	-.4269548486	0.0124075316
SepalWidth	Sepal Width in mm.	-.5212416758	0.7352613085
PetalLength	Petal Length in mm.	0.9472572487	-.4010378190
PetalWidth	Petal Width in mm.	0.5751607719	0.5810398645

Fisher (1936) Iris Data

The CANDISC Procedure

Raw Canonical Coefficients
Variable	Label	Can1	Can2
SepalLength	Sepal Length in mm.	-.0829377642	0.0024102149
SepalWidth	Sepal Width in mm.	-.1534473068	0.2164521235
PetalLength	Petal Length in mm.	0.2201211656	-.0931921210
PetalWidth	Petal Width in mm.	0.2810460309	0.2839187853

Class Means on Canonical Variables
Species	Can1	Can2
Setosa	-7.607599927	0.215133017
Versicolor	1.825049490	-0.727899622
Virginica	5.782550437	0.512766605

The plot of canonical variables in Output 21.1.8 shows that of the two canonical variables Can1 has the most discriminatory power. The following invocation of the %PLOTIT macro creates this plot:

   %plotit(data=outcan, plotvars=Can2 Can1,
           labelvar=_blank_, symvar=symbol, typevar=symbol,
           symsize=1, symlen=4, exttypes=symbol, ls=100,
           tsize=1.5, extend=close);

Output 21.1.8: Iris Data: Plot of First Two Canonical Variables

Chapter Contents
Previous
Next
Top