Getting Started
The data in this example are measurements taken on 159 fish caught off the coast of Finland.
The species, weight, three different length measurements, height, and
width of each fish are tallied. The full data set is displayed in
Chapter 60, "The STEPDISC Procedure."
The STEPDISC procedure identifies all the variables as
significant indicators of the differences among the seven
fish species. The goal now is to find a discriminant
function based on these six variables that best classifies
the fish into species.
First, assume that the data are normally distributed within each group
with equal covariances across groups.
The following program uses PROC DISCRIM to analyze
the Fish data and create Figure 25.1 through Figure 25.5.
proc format;
value specfmt
1='Bream'
2='Roach'
3='Whitefish'
4='Parkki'
5='Perch'
6='Pike'
7='Smelt';
data fish (drop=HtPct WidthPct);
title 'Fish Measurement Data';
input Species Weight Length1 Length2 Length3 HtPct
WidthPct @@;
Height=HtPct*Length3/100;
Width=WidthPct*Length3/100;
format Species specfmt.;
symbol = put(Species, specfmt.);
datalines;
1 242.0 23.2 25.4 30.0 38.4 13.4
1 290.0 24.0 26.3 31.2 40.0 13.8
1 340.0 23.9 26.5 31.1 39.8 15.1
1 363.0 26.3 29.0 33.5 38.0 13.3
...[155 more records]
;
proc discrim data=fish;
class Species;
run;
The DISCRIM procedure begins by displaying summary information
about the variables in the analysis.
This information includes the number of observations, the
number of quantitative variables in the analysis (specified
with the VAR statement), and the number of classes in the
classification variable (specified with the CLASS statement).
The frequency of each class, its weight, proportion of
the total sample, and prior probability
are also displayed. Equal priors are assigned by default.
Observations |
158 |
DF Total |
157 |
Variables |
6 |
DF Within Classes |
151 |
Classes |
7 |
DF Between Classes |
6 |
Class Level Information |
Species |
Variable Name |
Frequency |
Weight |
Proportion |
Prior Probability |
Bream |
Bream |
34 |
34.0000 |
0.215190 |
0.142857 |
Parkki |
Parkki |
11 |
11.0000 |
0.069620 |
0.142857 |
Perch |
Perch |
56 |
56.0000 |
0.354430 |
0.142857 |
Pike |
Pike |
17 |
17.0000 |
0.107595 |
0.142857 |
Roach |
Roach |
20 |
20.0000 |
0.126582 |
0.142857 |
Smelt |
Smelt |
14 |
14.0000 |
0.088608 |
0.142857 |
Whitefish |
Whitefish |
6 |
6.0000 |
0.037975 |
0.142857 |
|
Figure 25.1: Summary Information
The natural log of the determinant of the pooled covariance matrix is
displayed next (Figure 25.2).
The squared distances between the classes are shown in
Figure 25.3.
Pooled Covariance Matrix Information |
Covariance Matrix Rank |
Natural Log of the Determinant of the Covariance Matrix |
6 |
4.17613 |
|
Figure 25.2: Pooled Covariance Matrix Information
Generalized Squared Distance to Species |
From Species |
Bream |
Parkki |
Perch |
Pike |
Roach |
Smelt |
Whitefish |
Bream |
0 |
83.32523 |
243.66688 |
310.52333 |
133.06721 |
252.75503 |
132.05820 |
Parkki |
83.32523 |
0 |
57.09760 |
174.20918 |
27.00096 |
60.52076 |
26.54855 |
Perch |
243.66688 |
57.09760 |
0 |
101.06791 |
29.21632 |
29.26806 |
20.43791 |
Pike |
310.52333 |
174.20918 |
101.06791 |
0 |
92.40876 |
127.82177 |
99.90673 |
Roach |
133.06721 |
27.00096 |
29.21632 |
92.40876 |
0 |
33.84280 |
6.31997 |
Smelt |
252.75503 |
60.52076 |
29.26806 |
127.82177 |
33.84280 |
0 |
46.37326 |
Whitefish |
132.05820 |
26.54855 |
20.43791 |
99.90673 |
6.31997 |
46.37326 |
0 |
|
Figure 25.3: Squared Distances
The coefficients of the linear discriminant function
are displayed (in Figure 25.4) with the default
options METHOD=NORMAL and POOL=YES.
Linear Discriminant Function for Species |
Variable |
Bream |
Parkki |
Perch |
Pike |
Roach |
Smelt |
Whitefish |
Constant |
-185.91682 |
-64.92517 |
-48.68009 |
-148.06402 |
-62.65963 |
-19.70401 |
-67.44603 |
Weight |
-0.10912 |
-0.09031 |
-0.09418 |
-0.13805 |
-0.09901 |
-0.05778 |
-0.09948 |
Length1 |
-23.02273 |
-13.64180 |
-19.45368 |
-20.92442 |
-14.63635 |
-4.09257 |
-22.57117 |
Length2 |
-26.70692 |
-5.38195 |
17.33061 |
6.19887 |
-7.47195 |
-3.63996 |
3.83450 |
Length3 |
50.55780 |
20.89531 |
5.25993 |
22.94989 |
25.00702 |
10.60171 |
21.12638 |
Height |
13.91638 |
8.44567 |
-1.42833 |
-8.99687 |
-0.26083 |
-1.84569 |
0.64957 |
Width |
-23.71895 |
-13.38592 |
1.32749 |
-9.13410 |
-3.74542 |
-3.43630 |
-2.52442 |
|
Figure 25.4: Linear Discriminant Function
A summary of how the discriminant function classifies
the data used to develop the function is displayed
last. In Figure 25.5, you see that only three of the
observations are misclassified. The error-count
estimates give the proportion of misclassified observations
in each group. Since you are classifying the same data that are used
to derive the discriminant function, these error-count
estimates are biased. One way to reduce the bias of the error-count
estimates is to split the Fish
data into two sets, use one set to derive the discriminant
function, and use the other to run validation tests;
Example 25.4 shows
how to analyze a test data set. Another method
of reducing bias is to classify each observation using a discriminant
function computed from all of the other observations; this
method is invoked with the CROSSVALIDATE option.
The DISCRIM Procedure |
Classification Summary for Calibration Data: WORK.FISH |
Resubstitution Summary using Linear Discriminant Function |
Number of Observations and Percent Classified into Species |
From Species |
Bream |
Parkki |
Perch |
Pike |
Roach |
Smelt |
Whitefish |
Total |
Bream |
34
100.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
34
100.00 |
Parkki |
0
0.00 |
11
100.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
11
100.00 |
Perch |
0
0.00 |
0
0.00 |
53
94.64 |
0
0.00 |
0
0.00 |
3
5.36 |
0
0.00 |
56
100.00 |
Pike |
0
0.00 |
0
0.00 |
0
0.00 |
17
100.00 |
0
0.00 |
0
0.00 |
0
0.00 |
17
100.00 |
Roach |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
20
100.00 |
0
0.00 |
0
0.00 |
20
100.00 |
Smelt |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
14
100.00 |
0
0.00 |
14
100.00 |
Whitefish |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
0
0.00 |
6
100.00 |
6
100.00 |
Total |
34
21.52 |
11
6.96 |
53
33.54 |
17
10.76 |
20
12.66 |
17
10.76 |
6
3.80 |
158
100.00 |
Priors |
0.14286
|
0.14286
|
0.14286
|
0.14286
|
0.14286
|
0.14286
|
0.14286
|
|
Error Count Estimates for Species |
|
Bream |
Parkki |
Perch |
Pike |
Roach |
Smelt |
Whitefish |
Total |
Rate |
0.0000 |
0.0000 |
0.0536 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0077 |
Priors |
0.1429 |
0.1429 |
0.1429 |
0.1429 |
0.1429 |
0.1429 |
0.1429 |
|
|
Figure 25.5: Resubstitution Misclassification Summary
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.