Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The DISCRIM Procedure

Example 25.1: Univariate Density Estimates and Posterior Probabilities

In this example, several discriminant analyses are run with a single quantitative variable, petal width, so that density estimates and posterior probabilities can be plotted easily. The example produces Output 25.1.1 through Output 25.1.5. The GCHART procedure is used to display the sample distribution of petal width in the three species. Note the overlap between species I. versicolor and I. virginica that the bar chart shows. These statements produce Output 25.1.1:

   proc format;
      value specname
         1='Setosa    '
         2='Versicolor'
         3='Virginica ';
   run;

   data iris;
      title 'Discriminant Analysis of Fisher (1936) Iris Data';
      input SepalLength SepalWidth PetalLength PetalWidth  
            Species @@;
      format Species specname.;
      label SepalLength='Sepal Length in mm.'
            SepalWidth ='Sepal Width in mm.'
            PetalLength='Petal Length in mm.'
            PetalWidth ='Petal Width in mm.';
      symbol = put(Species, specname10.);
      datalines;
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
   63 33 60 25 3 53 37 15 02 1
   ;

   pattern1 c=red    /*v=l1   */;
   pattern2 c=yellow /*v=empty*/;
   pattern3 c=blue   /*v=r1   */;
   axis1 label=(angle=90);
   axis2 value=(height=.6);
   legend1 frame label=none;

   proc gchart data=iris;
      vbar PetalWidth / subgroup=Species midpoints=0 to 25
           raxis=axis1 maxis=axis2 legend=legend1 cframe=ligr;
   run;

Output 25.1.1: Sample Distribution of Petal Width in Three Species
disx1a.gif (5338 bytes)

In order to plot the density estimates and posterior probabilities, a data set called plotdata is created containing equally spaced values from -5 to 30, covering the range of petal width with a little to spare on each end. The plotdata data set is used with the TESTDATA= option in PROC DISCRIM.

   data plotdata;
      do PetalWidth=-5 to 30 by .5;
         output;
      end;
   run;

The same plots are produced after each discriminant analysis, so a macro can be used to reduce the amount of typing required. The macro PLOT uses two data sets. The data set plotd, containing density estimates, is created by the TESTOUTD= option in PROC DISCRIM. The data set plotp, containing posterior probabilities, is created by the TESTOUT= option. For each data set, the macro PLOT removes uninteresting values (near zero) and does an overlay plot showing all three species on a single plot. The following statements create the macro PLOT

   %macro plot;
      data plotd;
         set plotd;
         if setosa<.002 then setosa=.;
         if versicolor<.002 then versicolor=.;
         if virginica <.002 then virginica=.;
         label PetalWidth='Petal Width in mm.';
      run;

      symbol1 i=join v=none c=red    l=1 /*l=21*/;
      symbol2 i=join v=none c=yellow l=1 /*l= 1*/;
      symbol3 i=join v=none c=blue   l=1 /*l= 2*/;
      legend1 label=none frame;
      axis1 label=(angle=90 'Density') order=(0 to .6 by .1);
      
      proc gplot data=plotd;
         plot setosa*PetalWidth
              versicolor*PetalWidth
              virginica*PetalWidth
              / overlay vaxis=axis1 legend=legend1 frame 
                cframe=ligr;
         title3 'Plot of Estimated Densities';
      run;

      data plotp;
         set plotp;
         if setosa<.01 then setosa=.;
         if versicolor<.01 then versicolor=.;
         if virginica<.01 then virginica=.;
         label PetalWidth='Petal Width in mm.';
      run;

      axis1 label=(angle=90 'Posterior Probability') 
            order=(0 to 1 by .2);
      
      proc gplot data=plotp;
         plot setosa*PetalWidth
              versicolor*PetalWidth
              virginica*PetalWidth
              / overlay vaxis=axis1 legend=legend1 frame 
                cframe=ligr;
         title3 'Plot of Posterior Probabilities';
      run;
   %mend;

The first analysis uses normal-theory methods (METHOD=NORMAL) assuming equal variances (POOL=YES) in the three classes. The NOCLASSIFY option suppresses the resubstitution classification results of the input data set observations. The CROSSLISTERR option lists the observations that are misclassified under cross validation and displays cross validation error-rate estimates. The following statements produce Output 25.1.2:

   proc discrim data=iris method=normal pool=yes
                testdata=plotdata testout=plotp testoutd=plotd 
                short noclassify crosslisterr;
      class Species;
      var PetalWidth;
      title2 'Using Normal Density Estimates with Equal Variance';
   run;
   %plot

Output 25.1.2: Normal Density Estimates with Equal Variance

Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Equal Variance

The DISCRIM Procedure

Observations 150 DF Total 149
Variables 1 DF Within Classes 147
Classes 3 DF Between Classes 2

Class Level Information
Species Variable
Name
Frequency Weight Proportion Prior
Probability
Setosa Setosa 50 50.0000 0.333333 0.333333
Versicolor Versicolor 50 50.0000 0.333333 0.333333
Virginica Virginica 50 50.0000 0.333333 0.333333


Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Equal Variance

The DISCRIM Procedure
Classification Results for Calibration Data: WORK.IRIS
Cross-validation Results using Linear Discriminant Function

Posterior Probability of Membership in Species
Obs From Species Classified into
Species
Setosa Versicolor Virginica
5 Virginica Versicolor * 0.0000 0.9610 0.0390
9 Versicolor Virginica * 0.0000 0.0952 0.9048
57 Virginica Versicolor * 0.0000 0.9940 0.0060
78 Virginica Versicolor * 0.0000 0.8009 0.1991
91 Virginica Versicolor * 0.0000 0.9610 0.0390
148 Versicolor Virginica * 0.0000 0.3828 0.6172

* Misclassified observation


Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Equal Variance

The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.IRIS
Cross-validation Summary using Linear Discriminant Function

Number of Observations and Percent Classified
into Species
From Species Setosa Versicolor Virginica Total
Setosa 50
100.00
0
0.00
0
0.00
50
100.00
Versicolor 0
0.00
48
96.00
2
4.00
50
100.00
Virginica 0
0.00
4
8.00
46
92.00
50
100.00
Total 50
33.33
52
34.67
48
32.00
150
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 

Error Count Estimates for Species
  Setosa Versicolor Virginica Total
Rate 0.0000 0.0400 0.0800 0.0400
Priors 0.3333 0.3333 0.3333  


Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Equal Variance

The DISCRIM Procedure
Classification Summary for Test Data: WORK.PLOTDATA
Classification Summary using Linear Discriminant Function

Number of Observations and Percent Classified
into Species
  Setosa Versicolor Virginica Total
Total 26
36.62
18
25.35
27
38.03
71
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 


disx1f.gif (4504 bytes)

disx1g.gif (4959 bytes)

The next analysis uses normal-theory methods assuming unequal variances (POOL=NO) in the three classes. The following statements produce Output 25.1.3:

   proc discrim data=iris method=normal pool=no
                testdata=plotdata testout=plotp testoutd=plotd 
                short noclassify crosslisterr;
      class Species;
      var PetalWidth;
      title2 'Using Normal Density Estimates with Unequal Variance';
   run;
   %plot

Output 25.1.3: Normal Density Estimates with Unequal Variance

Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Unequal Variance

The DISCRIM Procedure

Observations 150 DF Total 149
Variables 1 DF Within Classes 147
Classes 3 DF Between Classes 2

Class Level Information
Species Variable
Name
Frequency Weight Proportion Prior
Probability
Setosa Setosa 50 50.0000 0.333333 0.333333
Versicolor Versicolor 50 50.0000 0.333333 0.333333
Virginica Virginica 50 50.0000 0.333333 0.333333


Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Unequal Variance

The DISCRIM Procedure
Classification Results for Calibration Data: WORK.IRIS
Cross-validation Results using Quadratic Discriminant Function

Posterior Probability of Membership in Species
Obs From Species Classified into
Species
Setosa Versicolor Virginica
5 Virginica Versicolor * 0.0000 0.8740 0.1260
9 Versicolor Virginica * 0.0000 0.0686 0.9314
42 Setosa Versicolor * 0.4923 0.5073 0.0004
57 Virginica Versicolor * 0.0000 0.9602 0.0398
78 Virginica Versicolor * 0.0000 0.6558 0.3442
91 Virginica Versicolor * 0.0000 0.8740 0.1260
148 Versicolor Virginica * 0.0000 0.2871 0.7129

* Misclassified observation


Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Unequal Variance

The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.IRIS
Cross-validation Summary using Quadratic Discriminant Function

Number of Observations and Percent Classified
into Species
From Species Setosa Versicolor Virginica Total
Setosa 49
98.00
1
2.00
0
0.00
50
100.00
Versicolor 0
0.00
48
96.00
2
4.00
50
100.00
Virginica 0
0.00
4
8.00
46
92.00
50
100.00
Total 49
32.67
53
35.33
48
32.00
150
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 

Error Count Estimates for Species
  Setosa Versicolor Virginica Total
Rate 0.0200 0.0400 0.0800 0.0467
Priors 0.3333 0.3333 0.3333  


Discriminant Analysis of Fisher (1936) Iris Data
Using Normal Density Estimates with Unequal Variance

The DISCRIM Procedure
Classification Summary for Test Data: WORK.PLOTDATA
Classification Summary using Quadratic Discriminant Function

Number of Observations and Percent Classified
into Species
  Setosa Versicolor Virginica Total
Total 23
32.39
20
28.17
28
39.44
71
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 


disx1m.gif (4597 bytes)

disx1n.gif (4950 bytes)

Two more analyses are run with nonparametric methods (METHOD=NPAR), specifically kernel density estimates with normal kernels (KERNEL=NORMAL). The first of these uses equal bandwidths (smoothing parameters) (POOL=YES) in each class. The use of equal bandwidths does not constrain the density estimates to be of equal variance. The value of the radius parameter that, assuming normality, minimizes an approximate mean integrated square error is 0.48 (see the "Nonparametric Methods" section). Choosing r=0.4 gives a more detailed look at the irregularities in the data. The following statements produce Output 25.1.4:

   proc discrim data=iris method=npar kernel=normal 
                   r=.4 pool=yes 
                testdata=plotdata testout=plotp 
                   testoutd=plotd  
                short noclassify crosslisterr;
      class Species;
      var PetalWidth;
      title2 'Using Kernel Density Estimates with Equal 
              Bandwidth';
   run;
   %plot

Output 25.1.4: Kernel Density Estimates with Equal Bandwidth

Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Equal Bandwidth

The DISCRIM Procedure

Observations 150 DF Total 149
Variables 1 DF Within Classes 147
Classes 3 DF Between Classes 2

Class Level Information
Species Variable
Name
Frequency Weight Proportion Prior
Probability
Setosa Setosa 50 50.0000 0.333333 0.333333
Versicolor Versicolor 50 50.0000 0.333333 0.333333
Virginica Virginica 50 50.0000 0.333333 0.333333


Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Equal Bandwidth

The DISCRIM Procedure
Classification Results for Calibration Data: WORK.IRIS
Cross-validation Results using Normal Kernel Density

Posterior Probability of Membership in Species
Obs From Species Classified into
Species
Setosa Versicolor Virginica
5 Virginica Versicolor * 0.0000 0.8827 0.1173
9 Versicolor Virginica * 0.0000 0.0438 0.9562
57 Virginica Versicolor * 0.0000 0.9472 0.0528
78 Virginica Versicolor * 0.0000 0.8061 0.1939
91 Virginica Versicolor * 0.0000 0.8827 0.1173
148 Versicolor Virginica * 0.0000 0.2586 0.7414

* Misclassified observation


Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Equal Bandwidth

The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.IRIS
Cross-validation Summary using Normal Kernel Density

Number of Observations and Percent Classified
into Species
From Species Setosa Versicolor Virginica Total
Setosa 50
100.00
0
0.00
0
0.00
50
100.00
Versicolor 0
0.00
48
96.00
2
4.00
50
100.00
Virginica 0
0.00
4
8.00
46
92.00
50
100.00
Total 50
33.33
52
34.67
48
32.00
150
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 

Error Count Estimates for Species
  Setosa Versicolor Virginica Total
Rate 0.0000 0.0400 0.0800 0.0400
Priors 0.3333 0.3333 0.3333  


Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Equal Bandwidth

The DISCRIM Procedure
Classification Summary for Test Data: WORK.PLOTDATA
Classification Summary using Normal Kernel Density

Number of Observations and Percent Classified
into Species
  Setosa Versicolor Virginica Total
Total 26
36.62
18
25.35
27
38.03
71
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 


disx1s.gif (4585 bytes)

disx1t.gif (4927 bytes)

Another nonparametric analysis is run with unequal bandwidths (POOL=NO). These statements produce Output 25.1.5:

   proc discrim data=iris method=npar kernel=normal 
                   r=.4 pool=no 
                testdata=plotdata testout=plotp 
                   testoutd=plotd
                short noclassify crosslisterr;
      class Species;
      var PetalWidth;
      title2 'Using Kernel Density Estimates with Unequal 
              Bandwidth';
   run;
   %plot

Output 25.1.5: Kernel Density Estimates with Unequal Bandwidth

Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Unequal Bandwidth

The DISCRIM Procedure

Observations 150 DF Total 149
Variables 1 DF Within Classes 147
Classes 3 DF Between Classes 2

Class Level Information
Species Variable
Name
Frequency Weight Proportion Prior
Probability
Setosa Setosa 50 50.0000 0.333333 0.333333
Versicolor Versicolor 50 50.0000 0.333333 0.333333
Virginica Virginica 50 50.0000 0.333333 0.333333


Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Unequal Bandwidth

The DISCRIM Procedure
Classification Results for Calibration Data: WORK.IRIS
Cross-validation Results using Normal Kernel Density

Posterior Probability of Membership in Species
Obs From Species Classified into
Species
Setosa Versicolor Virginica
5 Virginica Versicolor * 0.0000 0.8805 0.1195
9 Versicolor Virginica * 0.0000 0.0466 0.9534
57 Virginica Versicolor * 0.0000 0.9394 0.0606
78 Virginica Versicolor * 0.0000 0.7193 0.2807
91 Virginica Versicolor * 0.0000 0.8805 0.1195
148 Versicolor Virginica * 0.0000 0.2275 0.7725

* Misclassified observation


Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Unequal Bandwidth

The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.IRIS
Cross-validation Summary using Normal Kernel Density

Number of Observations and Percent Classified
into Species
From Species Setosa Versicolor Virginica Total
Setosa 50
100.00
0
0.00
0
0.00
50
100.00
Versicolor 0
0.00
48
96.00
2
4.00
50
100.00
Virginica 0
0.00
4
8.00
46
92.00
50
100.00
Total 50
33.33
52
34.67
48
32.00
150
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 

Error Count Estimates for Species
  Setosa Versicolor Virginica Total
Rate 0.0000 0.0400 0.0800 0.0400
Priors 0.3333 0.3333 0.3333  


Discriminant Analysis of Fisher (1936) Iris Data
Using Kernel Density Estimates with Unequal Bandwidth

The DISCRIM Procedure
Classification Summary for Test Data: WORK.PLOTDATA
Classification Summary using Normal Kernel Density

Number of Observations and Percent Classified
into Species
  Setosa Versicolor Virginica Total
Total 25
35.21
18
25.35
28
39.44
71
100.00
Priors 0.33333
 
0.33333
 
0.33333
 
 
 


disx1y.gif (4656 bytes)

disx1z.gif (4886 bytes)

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.