Example 25.1: Univariate Density Estimates and Posterior Probabilities
In this example, several discriminant analyses are run with
a single quantitative variable, petal width, so that density
estimates and posterior probabilities can be plotted easily.
The example produces Output 25.1.1
through Output 25.1.5.
The GCHART procedure is used to display the sample
distribution of petal width in the three species.
Note the overlap between species I. versicolor
and I. virginica that the bar chart shows.
These statements produce Output 25.1.1:
proc format;
value specname
1='Setosa '
2='Versicolor'
3='Virginica ';
run;
data iris;
title 'Discriminant Analysis of Fisher (1936) Iris Data';
input SepalLength SepalWidth PetalLength PetalWidth
Species @@;
format Species specname.;
label SepalLength='Sepal Length in mm.'
SepalWidth ='Sepal Width in mm.'
PetalLength='Petal Length in mm.'
PetalWidth ='Petal Width in mm.';
symbol = put(Species, specname10.);
datalines;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;
pattern1 c=red /*v=l1 */;
pattern2 c=yellow /*v=empty*/;
pattern3 c=blue /*v=r1 */;
axis1 label=(angle=90);
axis2 value=(height=.6);
legend1 frame label=none;
proc gchart data=iris;
vbar PetalWidth / subgroup=Species midpoints=0 to 25
raxis=axis1 maxis=axis2 legend=legend1 cframe=ligr;
run;
Output 25.1.1: Sample Distribution of Petal Width in Three Species
In order to plot the density estimates and posterior
probabilities, a data set called plotdata is created
containing equally spaced values from -5 to 30, covering
the range of petal width with a little to spare on each end.
The plotdata data set is used with
the TESTDATA= option in PROC DISCRIM.
data plotdata;
do PetalWidth=-5 to 30 by .5;
output;
end;
run;
The same plots are produced after each discriminant analysis,
so a macro can be used to reduce the amount of typing required.
The macro PLOT uses two data sets.
The data set plotd, containing density estimates,
is created by the TESTOUTD= option in PROC DISCRIM.
The data set plotp, containing posterior
probabilities, is created by the TESTOUT= option.
For each data set, the macro PLOT removes uninteresting
values (near zero) and does an overlay plot showing all
three species on a single plot.
The following statements create the macro PLOT
%macro plot;
data plotd;
set plotd;
if setosa<.002 then setosa=.;
if versicolor<.002 then versicolor=.;
if virginica <.002 then virginica=.;
label PetalWidth='Petal Width in mm.';
run;
symbol1 i=join v=none c=red l=1 /*l=21*/;
symbol2 i=join v=none c=yellow l=1 /*l= 1*/;
symbol3 i=join v=none c=blue l=1 /*l= 2*/;
legend1 label=none frame;
axis1 label=(angle=90 'Density') order=(0 to .6 by .1);
proc gplot data=plotd;
plot setosa*PetalWidth
versicolor*PetalWidth
virginica*PetalWidth
/ overlay vaxis=axis1 legend=legend1 frame
cframe=ligr;
title3 'Plot of Estimated Densities';
run;
data plotp;
set plotp;
if setosa<.01 then setosa=.;
if versicolor<.01 then versicolor=.;
if virginica<.01 then virginica=.;
label PetalWidth='Petal Width in mm.';
run;
axis1 label=(angle=90 'Posterior Probability')
order=(0 to 1 by .2);
proc gplot data=plotp;
plot setosa*PetalWidth
versicolor*PetalWidth
virginica*PetalWidth
/ overlay vaxis=axis1 legend=legend1 frame
cframe=ligr;
title3 'Plot of Posterior Probabilities';
run;
%mend;
The first analysis uses normal-theory methods (METHOD=NORMAL)
assuming equal variances (POOL=YES) in the three classes.
The NOCLASSIFY option suppresses the resubstitution
classification results of the input data set observations.
The CROSSLISTERR option lists the observations
that are misclassified under cross validation
and displays cross validation error-rate estimates.
The following statements produce Output 25.1.2:
proc discrim data=iris method=normal pool=yes
testdata=plotdata testout=plotp testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Normal Density Estimates with Equal Variance';
run;
%plot
Output 25.1.2: Normal Density Estimates with Equal Variance
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Equal Variance |
Observations |
150 |
DF Total |
149 |
Variables |
1 |
DF Within Classes |
147 |
Classes |
3 |
DF Between Classes |
2 |
Class Level Information |
Species |
Variable Name |
Frequency |
Weight |
Proportion |
Prior Probability |
Setosa |
Setosa |
50 |
50.0000 |
0.333333 |
0.333333 |
Versicolor |
Versicolor |
50 |
50.0000 |
0.333333 |
0.333333 |
Virginica |
Virginica |
50 |
50.0000 |
0.333333 |
0.333333 |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Equal Variance |
The DISCRIM Procedure |
Classification Results for Calibration Data: WORK.IRIS |
Cross-validation Results using Linear Discriminant Function |
Posterior Probability of Membership in Species |
Obs |
From Species |
Classified into Species |
Setosa |
Versicolor |
Virginica |
5 |
Virginica |
Versicolor |
* |
0.0000 |
0.9610 |
0.0390 |
9 |
Versicolor |
Virginica |
* |
0.0000 |
0.0952 |
0.9048 |
57 |
Virginica |
Versicolor |
* |
0.0000 |
0.9940 |
0.0060 |
78 |
Virginica |
Versicolor |
* |
0.0000 |
0.8009 |
0.1991 |
91 |
Virginica |
Versicolor |
* |
0.0000 |
0.9610 |
0.0390 |
148 |
Versicolor |
Virginica |
* |
0.0000 |
0.3828 |
0.6172 |
* Misclassified observation |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Equal Variance |
The DISCRIM Procedure |
Classification Summary for Calibration Data: WORK.IRIS |
Cross-validation Summary using Linear Discriminant Function |
Number of Observations and Percent Classified into Species |
From Species |
Setosa |
Versicolor |
Virginica |
Total |
Setosa |
50
100.00 |
0
0.00 |
0
0.00 |
50
100.00 |
Versicolor |
0
0.00 |
48
96.00 |
2
4.00 |
50
100.00 |
Virginica |
0
0.00 |
4
8.00 |
46
92.00 |
50
100.00 |
Total |
50
33.33 |
52
34.67 |
48
32.00 |
150
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
Error Count Estimates for Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Rate |
0.0000 |
0.0400 |
0.0800 |
0.0400 |
Priors |
0.3333 |
0.3333 |
0.3333 |
|
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Equal Variance |
The DISCRIM Procedure |
Classification Summary for Test Data: WORK.PLOTDATA |
Classification Summary using Linear Discriminant Function |
Number of Observations and Percent Classified into Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Total |
26
36.62 |
18
25.35 |
27
38.03 |
71
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
|
The next analysis uses normal-theory methods assuming
unequal variances (POOL=NO) in the three classes.
The following statements produce Output 25.1.3:
proc discrim data=iris method=normal pool=no
testdata=plotdata testout=plotp testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Normal Density Estimates with Unequal Variance';
run;
%plot
Output 25.1.3: Normal Density Estimates with Unequal Variance
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Unequal Variance |
Observations |
150 |
DF Total |
149 |
Variables |
1 |
DF Within Classes |
147 |
Classes |
3 |
DF Between Classes |
2 |
Class Level Information |
Species |
Variable Name |
Frequency |
Weight |
Proportion |
Prior Probability |
Setosa |
Setosa |
50 |
50.0000 |
0.333333 |
0.333333 |
Versicolor |
Versicolor |
50 |
50.0000 |
0.333333 |
0.333333 |
Virginica |
Virginica |
50 |
50.0000 |
0.333333 |
0.333333 |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Unequal Variance |
The DISCRIM Procedure |
Classification Results for Calibration Data: WORK.IRIS |
Cross-validation Results using Quadratic Discriminant Function |
Posterior Probability of Membership in Species |
Obs |
From Species |
Classified into Species |
Setosa |
Versicolor |
Virginica |
5 |
Virginica |
Versicolor |
* |
0.0000 |
0.8740 |
0.1260 |
9 |
Versicolor |
Virginica |
* |
0.0000 |
0.0686 |
0.9314 |
42 |
Setosa |
Versicolor |
* |
0.4923 |
0.5073 |
0.0004 |
57 |
Virginica |
Versicolor |
* |
0.0000 |
0.9602 |
0.0398 |
78 |
Virginica |
Versicolor |
* |
0.0000 |
0.6558 |
0.3442 |
91 |
Virginica |
Versicolor |
* |
0.0000 |
0.8740 |
0.1260 |
148 |
Versicolor |
Virginica |
* |
0.0000 |
0.2871 |
0.7129 |
* Misclassified observation |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Unequal Variance |
The DISCRIM Procedure |
Classification Summary for Calibration Data: WORK.IRIS |
Cross-validation Summary using Quadratic Discriminant Function |
Number of Observations and Percent Classified into Species |
From Species |
Setosa |
Versicolor |
Virginica |
Total |
Setosa |
49
98.00 |
1
2.00 |
0
0.00 |
50
100.00 |
Versicolor |
0
0.00 |
48
96.00 |
2
4.00 |
50
100.00 |
Virginica |
0
0.00 |
4
8.00 |
46
92.00 |
50
100.00 |
Total |
49
32.67 |
53
35.33 |
48
32.00 |
150
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
Error Count Estimates for Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Rate |
0.0200 |
0.0400 |
0.0800 |
0.0467 |
Priors |
0.3333 |
0.3333 |
0.3333 |
|
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Normal Density Estimates with Unequal Variance |
The DISCRIM Procedure |
Classification Summary for Test Data: WORK.PLOTDATA |
Classification Summary using Quadratic Discriminant Function |
Number of Observations and Percent Classified into Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Total |
23
32.39 |
20
28.17 |
28
39.44 |
71
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
|
Two more analyses are run with nonparametric methods (METHOD=NPAR),
specifically kernel density estimates with normal kernels
(KERNEL=NORMAL). The first of these uses equal bandwidths (smoothing
parameters) (POOL=YES) in each class. The use of equal bandwidths
does not constrain the density estimates to be of equal variance. The
value of the radius parameter that, assuming normality, minimizes an
approximate mean integrated square error is 0.48 (see
the "Nonparametric Methods" section). Choosing
r=0.4 gives a more detailed look at the irregularities in the data.
The following statements produce
Output 25.1.4:
proc discrim data=iris method=npar kernel=normal
r=.4 pool=yes
testdata=plotdata testout=plotp
testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Kernel Density Estimates with Equal
Bandwidth';
run;
%plot
Output 25.1.4: Kernel Density Estimates with Equal Bandwidth
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Equal Bandwidth |
Observations |
150 |
DF Total |
149 |
Variables |
1 |
DF Within Classes |
147 |
Classes |
3 |
DF Between Classes |
2 |
Class Level Information |
Species |
Variable Name |
Frequency |
Weight |
Proportion |
Prior Probability |
Setosa |
Setosa |
50 |
50.0000 |
0.333333 |
0.333333 |
Versicolor |
Versicolor |
50 |
50.0000 |
0.333333 |
0.333333 |
Virginica |
Virginica |
50 |
50.0000 |
0.333333 |
0.333333 |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Equal Bandwidth |
The DISCRIM Procedure |
Classification Results for Calibration Data: WORK.IRIS |
Cross-validation Results using Normal Kernel Density |
Posterior Probability of Membership in Species |
Obs |
From Species |
Classified into Species |
Setosa |
Versicolor |
Virginica |
5 |
Virginica |
Versicolor |
* |
0.0000 |
0.8827 |
0.1173 |
9 |
Versicolor |
Virginica |
* |
0.0000 |
0.0438 |
0.9562 |
57 |
Virginica |
Versicolor |
* |
0.0000 |
0.9472 |
0.0528 |
78 |
Virginica |
Versicolor |
* |
0.0000 |
0.8061 |
0.1939 |
91 |
Virginica |
Versicolor |
* |
0.0000 |
0.8827 |
0.1173 |
148 |
Versicolor |
Virginica |
* |
0.0000 |
0.2586 |
0.7414 |
* Misclassified observation |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Equal Bandwidth |
The DISCRIM Procedure |
Classification Summary for Calibration Data: WORK.IRIS |
Cross-validation Summary using Normal Kernel Density |
Number of Observations and Percent Classified into Species |
From Species |
Setosa |
Versicolor |
Virginica |
Total |
Setosa |
50
100.00 |
0
0.00 |
0
0.00 |
50
100.00 |
Versicolor |
0
0.00 |
48
96.00 |
2
4.00 |
50
100.00 |
Virginica |
0
0.00 |
4
8.00 |
46
92.00 |
50
100.00 |
Total |
50
33.33 |
52
34.67 |
48
32.00 |
150
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
Error Count Estimates for Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Rate |
0.0000 |
0.0400 |
0.0800 |
0.0400 |
Priors |
0.3333 |
0.3333 |
0.3333 |
|
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Equal Bandwidth |
The DISCRIM Procedure |
Classification Summary for Test Data: WORK.PLOTDATA |
Classification Summary using Normal Kernel Density |
Number of Observations and Percent Classified into Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Total |
26
36.62 |
18
25.35 |
27
38.03 |
71
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
|
Another nonparametric analysis is run
with unequal bandwidths (POOL=NO).
These statements produce Output 25.1.5:
proc discrim data=iris method=npar kernel=normal
r=.4 pool=no
testdata=plotdata testout=plotp
testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Kernel Density Estimates with Unequal
Bandwidth';
run;
%plot
Output 25.1.5: Kernel Density Estimates with Unequal Bandwidth
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Unequal Bandwidth |
Observations |
150 |
DF Total |
149 |
Variables |
1 |
DF Within Classes |
147 |
Classes |
3 |
DF Between Classes |
2 |
Class Level Information |
Species |
Variable Name |
Frequency |
Weight |
Proportion |
Prior Probability |
Setosa |
Setosa |
50 |
50.0000 |
0.333333 |
0.333333 |
Versicolor |
Versicolor |
50 |
50.0000 |
0.333333 |
0.333333 |
Virginica |
Virginica |
50 |
50.0000 |
0.333333 |
0.333333 |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Unequal Bandwidth |
The DISCRIM Procedure |
Classification Results for Calibration Data: WORK.IRIS |
Cross-validation Results using Normal Kernel Density |
Posterior Probability of Membership in Species |
Obs |
From Species |
Classified into Species |
Setosa |
Versicolor |
Virginica |
5 |
Virginica |
Versicolor |
* |
0.0000 |
0.8805 |
0.1195 |
9 |
Versicolor |
Virginica |
* |
0.0000 |
0.0466 |
0.9534 |
57 |
Virginica |
Versicolor |
* |
0.0000 |
0.9394 |
0.0606 |
78 |
Virginica |
Versicolor |
* |
0.0000 |
0.7193 |
0.2807 |
91 |
Virginica |
Versicolor |
* |
0.0000 |
0.8805 |
0.1195 |
148 |
Versicolor |
Virginica |
* |
0.0000 |
0.2275 |
0.7725 |
* Misclassified observation |
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Unequal Bandwidth |
The DISCRIM Procedure |
Classification Summary for Calibration Data: WORK.IRIS |
Cross-validation Summary using Normal Kernel Density |
Number of Observations and Percent Classified into Species |
From Species |
Setosa |
Versicolor |
Virginica |
Total |
Setosa |
50
100.00 |
0
0.00 |
0
0.00 |
50
100.00 |
Versicolor |
0
0.00 |
48
96.00 |
2
4.00 |
50
100.00 |
Virginica |
0
0.00 |
4
8.00 |
46
92.00 |
50
100.00 |
Total |
50
33.33 |
52
34.67 |
48
32.00 |
150
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
Error Count Estimates for Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Rate |
0.0000 |
0.0400 |
0.0800 |
0.0400 |
Priors |
0.3333 |
0.3333 |
0.3333 |
|
|
Discriminant Analysis of Fisher (1936) Iris Data |
Using Kernel Density Estimates with Unequal Bandwidth |
The DISCRIM Procedure |
Classification Summary for Test Data: WORK.PLOTDATA |
Classification Summary using Normal Kernel Density |
Number of Observations and Percent Classified into Species |
|
Setosa |
Versicolor |
Virginica |
Total |
Total |
25
35.21 |
18
25.35 |
28
39.44 |
71
100.00 |
Priors |
0.33333
|
0.33333
|
0.33333
|
|
|
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.