Example 23.5: Computing a Distance Matrix
A wide variety of distance and similarity measures are used
in cluster analysis (Anderberg 1973, Sneath and Sokal 1973).
If your data are in coordinate form and you want to use
a non-Euclidean distance for clustering, you can compute
a distance matrix using a DATA step or the IML procedure.
Similarity measures must be converted to
dissimilarities before being used in PROC CLUSTER.
Such conversion can be done in a variety of ways, such
as taking reciprocals or subtracting from a large value.
The choice of conversion method depends on
the application and the similarity measure.
In the following example, the observations are states.
Binary-valued variables correspond to various
grounds for divorce and indicate whether the
grounds for divorce apply in each of the states.
The %DISTANCE*
macro is used to compute the Jaccard coefficient
(Anderberg 1973, pp. 89, 115, and 117) between each pair of states.
The Jaccard coefficient is defined as the number of variables
that are coded as 1 for both states divided by the number of
variables that are coded as 1 for either or both states.
The Jaccard coefficient is converted to a
distance measure by subtracting it from 1.
%include '<location of SAS/STAT sample library>/xmacro.sas';
%include '<location of SAS/STAT sample library>/distnew.sas';
options ls=120 ps=60;
data divorce;
title 'Grounds for Divorce';
input state $15.
(incompat cruelty desertn non_supp alcohol
felony impotenc insanity separate) (1.) @@;
if mod(_n_,2) then input +4 @@; else input;
datalines;
ALABAMA 111111111 ALASKA 111011110
ARIZONA 100000000 ARKANSAS 011111111
CALIFORNIA 100000010 COLORADO 100000000
CONNECTICUT 111111011 DELAWARE 100000001
FLORIDA 100000010 GEORGIA 111011110
HAWAII 100000001 IDAHO 111111011
ILLINOIS 011011100 INDIANA 100001110
IOWA 100000000 KANSAS 111011110
KENTUCKY 100000000 LOUISIANA 000001001
MAINE 111110110 MARYLAND 011001111
MASSACHUSETTS 111111101 MICHIGAN 100000000
MINNESOTA 100000000 MISSISSIPPI 111011110
MISSOURI 100000000 MONTANA 100000000
NEBRASKA 100000000 NEVADA 100000011
NEW HAMPSHIRE 111111100 NEW JERSEY 011011011
NEW MEXICO 111000000 NEW YORK 011001001
NORTH CAROLINA 000000111 NORTH DAKOTA 111111110
OHIO 111011101 OKLAHOMA 111111110
OREGON 100000000 PENNSYLVANIA 011001110
RHODE ISLAND 111111101 SOUTH CAROLINA 011010001
SOUTH DAKOTA 011111000 TENNESSEE 111111100
TEXAS 111001011 UTAH 011111110
VERMONT 011101011 VIRGINIA 010001001
WASHINGTON 100000001 WEST VIRGINIA 111011011
WISCONSIN 100000001 WYOMING 100000011
;
%distance(data=divorce, id=state, options=nomiss, out=distjacc,
shape=square, method=djaccard, var=incompat--separate);
proc print data=distjacc(obs=10);
id state; var alabama--georgia;
title2 'First 10 states';
run;
title2;
proc cluster data=distjacc method=centroid
pseudo outtree=tree;
id state;
var alabama--wyoming;
run;
proc tree data=tree noprint n=9 out=out;
id state;
run;
proc sort;
by state;
run;
data clus;
merge divorce out;
by state;
run;
proc sort;
by cluster;
run;
proc print;
id state;
var incompat--separate;
by cluster;
run;
Output 23.5.1: Computing a Distance Matrix
|
| Grounds for Divorce |
| First 10 states |
| state |
ALABAMA |
ALASKA |
ARIZONA |
ARKANSAS |
CALIFORNIA |
COLORADO |
CONNECTICUT |
DELAWARE |
FLORIDA |
GEORGIA |
| ALABAMA |
0.00000 |
0.22222 |
0.88889 |
0.11111 |
0.77778 |
0.88889 |
0.11111 |
0.77778 |
0.77778 |
0.22222 |
| ALASKA |
0.22222 |
0.00000 |
0.85714 |
0.33333 |
0.71429 |
0.85714 |
0.33333 |
0.87500 |
0.71429 |
0.00000 |
| ARIZONA |
0.88889 |
0.85714 |
0.00000 |
1.00000 |
0.50000 |
0.00000 |
0.87500 |
0.50000 |
0.50000 |
0.85714 |
| ARKANSAS |
0.11111 |
0.33333 |
1.00000 |
0.00000 |
0.88889 |
1.00000 |
0.22222 |
0.88889 |
0.88889 |
0.33333 |
| CALIFORNIA |
0.77778 |
0.71429 |
0.50000 |
0.88889 |
0.00000 |
0.50000 |
0.75000 |
0.66667 |
0.00000 |
0.71429 |
| COLORADO |
0.88889 |
0.85714 |
0.00000 |
1.00000 |
0.50000 |
0.00000 |
0.87500 |
0.50000 |
0.50000 |
0.85714 |
| CONNECTICUT |
0.11111 |
0.33333 |
0.87500 |
0.22222 |
0.75000 |
0.87500 |
0.00000 |
0.75000 |
0.75000 |
0.33333 |
| DELAWARE |
0.77778 |
0.87500 |
0.50000 |
0.88889 |
0.66667 |
0.50000 |
0.75000 |
0.00000 |
0.66667 |
0.87500 |
| FLORIDA |
0.77778 |
0.71429 |
0.50000 |
0.88889 |
0.00000 |
0.50000 |
0.75000 |
0.66667 |
0.00000 |
0.71429 |
| GEORGIA |
0.22222 |
0.00000 |
0.85714 |
0.33333 |
0.71429 |
0.85714 |
0.33333 |
0.87500 |
0.71429 |
0.00000 |
|
|
| The CLUSTER Procedure |
| Centroid Hierarchical Cluster Analysis |
| Root-Mean-Square Distance Between Observations = 0.694873 |
| Cluster History |
| NCL |
Clusters Joined |
FREQ |
PSF |
PST2 |
Norm Cent Dist |
T i e |
| 49 |
ARIZONA |
COLORADO |
2 |
. |
. |
0 |
T |
| 48 |
CALIFORNIA |
FLORIDA |
2 |
. |
. |
0 |
T |
| 47 |
ALASKA |
GEORGIA |
2 |
. |
. |
0 |
T |
| 46 |
DELAWARE |
HAWAII |
2 |
. |
. |
0 |
T |
| 45 |
CONNECTICUT |
IDAHO |
2 |
. |
. |
0 |
T |
| 44 |
CL49 |
IOWA |
3 |
. |
. |
0 |
T |
| 43 |
CL47 |
KANSAS |
3 |
. |
. |
0 |
T |
| 42 |
CL44 |
KENTUCKY |
4 |
. |
. |
0 |
T |
| 41 |
CL42 |
MICHIGAN |
5 |
. |
. |
0 |
T |
| 40 |
CL41 |
MINNESOTA |
6 |
. |
. |
0 |
T |
| 39 |
CL43 |
MISSISSIPPI |
4 |
. |
. |
0 |
T |
| 38 |
CL40 |
MISSOURI |
7 |
. |
. |
0 |
T |
| 37 |
CL38 |
MONTANA |
8 |
. |
. |
0 |
T |
| 36 |
CL37 |
NEBRASKA |
9 |
. |
. |
0 |
T |
| 35 |
NORTH DAKOTA |
OKLAHOMA |
2 |
. |
. |
0 |
T |
| 34 |
CL36 |
OREGON |
10 |
. |
. |
0 |
T |
| 33 |
MASSACHUSETTS |
RHODE ISLAND |
2 |
. |
. |
0 |
T |
| 32 |
NEW HAMPSHIRE |
TENNESSEE |
2 |
. |
. |
0 |
T |
| 31 |
CL46 |
WASHINGTON |
3 |
. |
. |
0 |
T |
| 30 |
CL31 |
WISCONSIN |
4 |
. |
. |
0 |
T |
| 29 |
NEVADA |
WYOMING |
2 |
. |
. |
0 |
|
| 28 |
ALABAMA |
ARKANSAS |
2 |
1561 |
. |
0.1599 |
T |
| 27 |
CL33 |
CL32 |
4 |
479 |
. |
0.1799 |
T |
| 26 |
CL39 |
CL35 |
6 |
265 |
. |
0.1799 |
T |
| 25 |
CL45 |
WEST VIRGINIA |
3 |
231 |
. |
0.1799 |
|
| 24 |
MARYLAND |
PENNSYLVANIA |
2 |
199 |
. |
0.2399 |
|
| 23 |
CL28 |
UTAH |
3 |
167 |
3.2 |
0.2468 |
|
| 22 |
CL27 |
OHIO |
5 |
136 |
5.4 |
0.2698 |
|
| 21 |
CL26 |
MAINE |
7 |
111 |
8.9 |
0.2998 |
|
| 20 |
CL23 |
CL21 |
10 |
75.2 |
8.7 |
0.3004 |
|
| 19 |
CL25 |
NEW JERSEY |
4 |
71.8 |
6.5 |
0.3053 |
T |
| 18 |
CL19 |
TEXAS |
5 |
69.1 |
2.5 |
0.3077 |
|
| 17 |
CL20 |
CL22 |
15 |
48.7 |
9.9 |
0.3219 |
|
| 16 |
NEW YORK |
VIRGINIA |
2 |
50.1 |
. |
0.3598 |
|
| 15 |
CL18 |
VERMONT |
6 |
49.4 |
2.9 |
0.3797 |
|
| 14 |
CL17 |
ILLINOIS |
16 |
47.0 |
3.2 |
0.4425 |
|
| 13 |
CL14 |
CL15 |
22 |
29.2 |
15.3 |
0.4722 |
|
| 12 |
CL48 |
CL29 |
4 |
29.5 |
. |
0.4797 |
T |
| 11 |
CL13 |
CL24 |
24 |
27.6 |
4.5 |
0.5042 |
|
| 10 |
CL11 |
SOUTH DAKOTA |
25 |
28.4 |
2.4 |
0.5449 |
|
| 9 |
LOUISIANA |
CL16 |
3 |
30.3 |
3.5 |
0.5844 |
|
| 8 |
CL34 |
CL30 |
14 |
23.3 |
. |
0.7196 |
|
| 7 |
CL8 |
CL12 |
18 |
19.3 |
15.0 |
0.7175 |
|
| 6 |
CL10 |
SOUTH CAROLINA |
26 |
21.4 |
4.2 |
0.7384 |
|
| 5 |
CL6 |
NEW MEXICO |
27 |
24.0 |
4.7 |
0.8303 |
|
| 4 |
CL5 |
INDIANA |
28 |
28.9 |
4.1 |
0.8343 |
|
| 3 |
CL4 |
CL9 |
31 |
31.7 |
10.9 |
0.8472 |
|
| 2 |
CL3 |
NORTH CAROLINA |
32 |
55.1 |
4.1 |
1.0017 |
|
| 1 |
CL2 |
CL7 |
50 |
. |
55.1 |
1.0663 |
|
|
|
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| ARIZONA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| COLORADO |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| IOWA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| KENTUCKY |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| MICHIGAN |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| MINNESOTA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| MISSOURI |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| MONTANA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| NEBRASKA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| OREGON |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| CALIFORNIA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
| FLORIDA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
| NEVADA |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
| WYOMING |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| ALABAMA |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
| ALASKA |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
| ARKANSAS |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
| CONNECTICUT |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
| GEORGIA |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
| IDAHO |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
| ILLINOIS |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
| KANSAS |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
| MAINE |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
| MARYLAND |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
| MASSACHUSETTS |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
| MISSISSIPPI |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
| NEW HAMPSHIRE |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
| NEW JERSEY |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
| NORTH DAKOTA |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
| OHIO |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
| OKLAHOMA |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
| PENNSYLVANIA |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
| RHODE ISLAND |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
| SOUTH DAKOTA |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
| TENNESSEE |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
| TEXAS |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
1 |
1 |
| UTAH |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
| VERMONT |
0 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
| WEST VIRGINIA |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| DELAWARE |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| HAWAII |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| WASHINGTON |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| WISCONSIN |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| LOUISIANA |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
| NEW YORK |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
1 |
| VIRGINIA |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| SOUTH CAROLINA |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| NEW MEXICO |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| INDIANA |
1 |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
| state |
incompat |
cruelty |
desertn |
non_supp |
alcohol |
felony |
impotenc |
insanity |
separate |
| NORTH CAROLINA |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
|
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.