Chapter Contents |
Previous |
Next |
The CORRESP Procedure |
In this example, PROC CORRESP reads an existing contingency table with supplementary observations and performs a simple correspondence analysis. The data are populations of the fifty states, grouped into regions, for each of the census years from 1920 to 1970 (U.S. Bureau of the Census 1979). Alaska and Hawaii are treated as supplementary regions. They were not states during this entire period and they are not physically connected to the other 48 states. Consequently, it is reasonable to expect that population changes in these two states operate differently from population changes in the other states. The correspondence analysis is performed giving the supplementary points negative weight, then the coordinates for the supplementary points are computed in the solution defined by the other points.
The initial DATA step reads the table, provides labels for the years, flags the supplementary rows with negative weights, and specifies absolute weights of 1000 for all observations since the data were originally reported in units of 1000 people.
In the PROC CORRESP statement, PRINT=PERCENT and the display options display the table of cell percentages (OBSERVED), cell contributions to the total chi-square scaled to sum to 100 (CELLCHI2), row profile rows that sum to 100 (RP), and column profile columns that sum to 100 (CP). The SHORT option specifies that the correspondence analysis summary statistics, contributions to inertia, and squared cosines should not be displayed. The option OUTC=COOR creates the output coordinate data set. Since the data are already in table form, a VAR statement is used to read the table. Row labels are specified with the ID statement, and column labels come from the variable labels. The WEIGHT statement flags the supplementary observations and restores the table values to populations.
The %PLOTIT macro is used to plot the results. Normally, you only need to tell the %PLOTIT macro the name of the input data set, DATA=Coor, and the type of analysis performed on the data, DATATYPE=CORRESP. In this case, PLOTVARS=Dim1 Dim2 is also specified to indicate that Dim1 is the vertical axis variable, as opposed to the default PLOTVARS=Dim2 Dim1.
For an essentially one-dimensional plot such as this, specifying PLOTVARS=Dim1 Dim2 improves the graphical display.
The following statements produce Output 24.3.1 and Output 24.3.2:
title 'United States Population'; data USPop; * Regions: * New England - ME, NH, VT, MA, RI, CT. * Great Lake - OH, IN, IL, MI, WI. * South Atlantic - DE, MD, DC, VA, WV, NC, SC, GA, FL. * Mountain - MT, ID, WY, CO, NM, AZ, UT, NV. * Pacific - WA, OR, CA. * * Note: Multiply data values by 1000 to get populations.; input Region $14. y1920 y1930 y1940 y1950 y1960 y1970; label y1920 = '1920' y1930 = '1930' y1940 = '1940' y1950 = '1950' y1960 = '1960' y1970 = '1970'; if region = 'Hawaii' or region = 'Alaska' then w = -1000; /* Flag Supplementary Observations */ else w = 1000; datalines; New England 7401 8166 8437 9314 10509 11842 NY, NJ, PA 22261 26261 27539 30146 34168 37199 Great Lake 21476 25297 26626 30399 36225 40252 Midwest 12544 13297 13517 14061 15394 16319 South Atlantic 13990 15794 17823 21182 25972 30671 KY, TN, AL, MS 8893 9887 10778 11447 12050 12803 AR, LA, OK, TX 10242 12177 13065 14538 16951 19321 Mountain 3336 3702 4150 5075 6855 8282 Pacific 5567 8195 9733 14486 20339 25454 Alaska 55 59 73 129 226 300 Hawaii 256 368 423 500 633 769 ; *---Perform Simple Correspondence Analysis---; proc corresp print=percent observed cellchi2 rp cp short outc=Coor; var y1920 -- y1970; id Region; weight w; run; *---Plot the Simple Correspondence Analysis Results---; %plotit(data=Coor, datatype=corresp, plotvars=Dim1 Dim2)
The contingency table shows that the population of all regions increased over this time period. The row profiles show that population is increasing at a different rate for the different regions. There is a small increase in population in the Midwest, for example, but the population has more than quadrupled in the Pacific region over the same period. The column profiles show that in 1920, the US population was concentrated in the NY, NJ, PA, Great Lakes, Midwest, and South Atlantic regions. With time, the population is shifting more to the South Atlantic, Mountain, and Pacific regions. This is also clear from the correspondence analysis. The inertia and chi-square decomposition table shows that there are five nontrivial dimensions in the table, but the association between the rows and columns is almost entirely one-dimensional.
Output 24.3.1: Simple Correspondence Analysis of a Contingency Table with Supplementary Observations
|
|
The plot also shows that the growth pattern for Hawaii is similar to the growth pattern for the mountain states and that Alaska's growth is even more extreme than the Pacific states' growth. The row profiles confirm this interpretation.
The Pacific region is farther from the origin than all other active points. The Midwest is the extreme region in the other direction. The table of contributions to the total chi-square shows that 62% of the total chi-square statistic is contributed by the Pacific region, which is followed by the Midwest at over 14%. Similarly the two extreme years, 1920 and 1970, together contribute over 63% to the total chi-square, whereas the years nearer the origin of the plot contribute less.
Output 24.3.2: Plot of Simple Correspondence Analysis of a Contingency Table with Supplementary Observations
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.