next up previous

STAT 350: Lecture 27

Variable Selection Methods

PROBLEM: Find a set of predictor variables which gives a good fit, predicts the dependent value well and is as small as possible.

Thus far we have used F and t tests to compare 2 models at a time. We have followed a sequence of tests to try to find a good set of variables but our method has been informal; other statisticians using the same method might select a different final model. Here we investigate 4 mechanical (more or less) variable selection methods: Forward, Backward, Stepwise and All Subsets.

FORWARD

BACKWARD

STEPWISE

ALL SUBSETS

Example

FORWARD Selection

data scenic;
 infile 'scenic.dat' firstobs=2;
 input Stay  Age Risk Culture Chest Beds 
    School Region Census Nurses Facil;
 Nratio = Nurses / Census  ;
proc reg  data=scenic;
  model Risk = Culture Stay Nurses Nratio 
     Chest Beds Census Facil / 
     selection=forward;
run ;
EDITED SAS OUTPUT (Complete output)
          Forward Selection Procedure for Dependent Variable RISK    
Step 1   Variable CULTURE Entered   R-square = 0.31265864   C(p) = 47.47794976
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       1            62.96314170      62.96314170      50.49   0.0001
 Error          111           138.41668131       1.24699713
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP       3.19789965      0.19376813     339.64905575     272.37   0.0001
 CULTURE        0.07325862      0.01030975      62.96314170      50.49   0.0001
--------------------------------------------------------------------------------
Step 2   Variable STAY Entered      R-square = 0.45040256   C(p) = 18.11960703
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       2            90.70198757      45.35099379      45.07   0.0001
 Error          110           110.67783543       1.00616214
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP       0.80549102      0.48775579       2.74400250       2.73   0.1015
 CULTURE        0.05645147      0.00979843      33.39687778      33.19   0.0001
 STAY           0.27547211      0.05246473      27.73884588      27.57   0.0001
--------------------------------------------------------------------------------
Step 3   Variable FACIL Entered     R-square = 0.49340010   C(p) = 10.33092385
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       3            99.36082444      33.12027481      35.39   0.0001
 Error          109           102.01899857       0.93595412
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP       0.49133226      0.48163614       0.97401801       1.04   0.3099
 CULTURE        0.05419997      0.00947933      30.59827862      32.69   0.0001
 STAY           0.22390748      0.05336561      16.47664606      17.60   0.0001
 FACIL          0.01963027      0.00645392       8.65883687       9.25   0.0029
--------------------------------------------------------------------------------
Step 4   Variable NRATIO Entered    R-square = 0.52547952   C(p) =  5.02782551
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       4           105.82097194      26.45524298      29.90   0.0001
 Error          108            95.55885107       0.88480418
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.49505513      0.59376426       0.61507231       0.70   0.4063
 CULTURE        0.04818092      0.00948204      22.84513509      25.82   0.0001
 STAY           0.26758404      0.05434637      21.44995791      24.24   0.0001
 NRATIO         0.79262357      0.29333869       6.46014750       7.30   0.0080
 FACIL          0.01747585      0.00632554       6.75349077       7.63   0.0067
--------------------------------------------------------------------------------
Step 5   Variable CHEST Entered     R-square = 0.53792463   C(p) =  4.19461013
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       5           108.32716704      21.66543341      24.91   0.0001
 Error          107            93.05265597       0.86965099
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.76804342      0.61022741       1.37763165       1.58   0.2109
 CULTURE        0.04318856      0.00984976      16.71979631      19.23   0.0001
 STAY           0.23392650      0.05741114      14.43814950      16.60   0.0001
 NRATIO         0.67240318      0.29931440       4.38883521       5.05   0.0267
 CHEST          0.00917860      0.00540681       2.50619510       2.88   0.0925
 FACIL          0.01843860      0.00629673       7.45710068       8.57   0.0042
--------------------------------------------------------------------------------
Step 6   Variable CENSUS Entered    R-square = 0.54146833   C(p) =  5.38786192
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       6           109.04079737      18.17346623      20.86   0.0001
 Error          106            92.33902564       0.87112288
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.60982957      0.63526660       0.80275743       0.92   0.3393
 CULTURE        0.04327612      0.00985857      16.78604192      19.27   0.0001
 STAY           0.21806849      0.06007156      11.47961755      13.18   0.0004
 NRATIO         0.74253899      0.30942748       5.01649265       5.76   0.0182
 CHEST          0.00967176      0.00543875       2.75481205       3.16   0.0782
 CENSUS         0.00092337      0.00102018       0.71363033       0.82   0.3675
 FACIL          0.01171496      0.00974167       1.25977812       1.45   0.2318
--------------------------------------------------------------------------------
No other variable met the 0.5000 significance level for entry into the model.
     Summary of Forward Selection Procedure for Dependent Variable RISK    
         Variable   Number   Partial     Model
 Step    Entered        In      R**2      R**2        C(p)           F   Prob>F
    1    CULTURE         1    0.3127    0.3127     47.4779     50.4918   0.0001
    2    STAY            2    0.1377    0.4504     18.1196     27.5690   0.0001
    3    FACIL           3    0.0430    0.4934     10.3309      9.2513   0.0029
    4    NRATIO          4    0.0321    0.5255      5.0278      7.3012   0.0080
    5    CHEST           5    0.0124    0.5379      4.1946      2.8818   0.0925
    6    CENSUS          6    0.0035    0.5415      5.3879      0.8192   0.3675

BACKWARD Selection

data scenic;
 infile 'scenic.dat' firstobs=2;
 input Stay  Age Risk Culture Chest Beds 
    School Region Census Nurses Facil;
 Nratio = Nurses / Census  ;
proc reg  data=scenic;
  model Risk = Culture Stay Nurses Nratio 
     Chest Beds Census Facil / 
     selection=backward;
run ;
EDITED SAS OUTPUT (Complete output)
         Backward Elimination Procedure for Dependent Variable RISK    
Step 0    All Variables Entered     R-square = 0.54317205   C(p) =  9.00000000
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       8           109.38389081      13.67298635      15.46   0.0001
 Error          104            91.99593220       0.88457627
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.61544324      0.66644759       0.75436112       0.85   0.3579
 CULTURE        0.04410334      0.01004539      17.05081019      19.28   0.0001
 STAY           0.20541554      0.06405141       9.09797087      10.29   0.0018
 NURSES        -0.00087592      0.00216138       0.14527948       0.16   0.6861
 NRATIO         0.85012850      0.39334446       4.13198139       4.67   0.0330
 CHEST          0.00946882      0.00549665       2.62501031       2.97   0.0879
 BEDS          -0.00106503      0.00265225       0.14263677       0.16   0.6888
 CENSUS         0.00295314      0.00357645       0.60311345       0.68   0.4109
 FACIL          0.01312502      0.01010817       1.49138778       1.69   0.1970
--------------------------------------------------------------------------------
Step 1   Variable BEDS Removed      R-square = 0.54246375   C(p) =  7.16124870
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       7           109.24125404      15.60589343      17.78   0.0001
 Error          105            92.13856897       0.87751018
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.66993458      0.64987468       0.93251908       1.06   0.3050
 CULTURE        0.04396605      0.00999939      16.96447009      19.33   0.0001
 STAY           0.21222554      0.06151830      10.44332482      11.90   0.0008
 NURSES        -0.00101551      0.00212470       0.20045667       0.23   0.6337
 NRATIO         0.85643956      0.39145742       4.20026358       4.79   0.0309
 CHEST          0.00947189      0.00547465       2.62672060       2.99   0.0865
 CENSUS         0.00178564      0.00207440       0.65020801       0.74   0.3913
 FACIL          0.01228514      0.00984983       1.36506916       1.56   0.2151
--------------------------------------------------------------------------------
Step 2   Variable NURSES Removed    R-square = 0.54146833   C(p) =  5.38786192
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       6           109.04079737      18.17346623      20.86   0.0001
 Error          106            92.33902564       0.87112288
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.60982957      0.63526660       0.80275743       0.92   0.3393
 CULTURE        0.04327612      0.00985857      16.78604192      19.27   0.0001
 STAY           0.21806849      0.06007156      11.47961755      13.18   0.0004
 NRATIO         0.74253899      0.30942748       5.01649265       5.76   0.0182
 CHEST          0.00967176      0.00543875       2.75481205       3.16   0.0782
 CENSUS         0.00092337      0.00102018       0.71363033       0.82   0.3675
 FACIL          0.01171496      0.00974167       1.25977812       1.45   0.2318
--------------------------------------------------------------------------------
Step 3   Variable CENSUS Removed    R-square = 0.53792463   C(p) =  4.19461013
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       5           108.32716704      21.66543341      24.91   0.0001
 Error          107            93.05265597       0.86965099
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.76804342      0.61022741       1.37763165       1.58   0.2109
 CULTURE        0.04318856      0.00984976      16.71979631      19.23   0.0001
 STAY           0.23392650      0.05741114      14.43814950      16.60   0.0001
 NRATIO         0.67240318      0.29931440       4.38883521       5.05   0.0267
 CHEST          0.00917860      0.00540681       2.50619510       2.88   0.0925
 FACIL          0.01843860      0.00629673       7.45710068       8.57   0.0042
--------------------------------------------------------------------------------
All variables left in the model are significant at the 0.1000 level.
   Summary of Backward Elimination Procedure for Dependent Variable RISK    
         Variable   Number   Partial     Model
 Step    Removed        In      R**2      R**2        C(p)           F   Prob>F
    1    BEDS            7    0.0007    0.5425      7.1612      0.1612   0.6888
    2    NURSES          6    0.0010    0.5415      5.3879      0.2284   0.6337
    3    CENSUS          5    0.0035    0.5379      4.1946      0.8192   0.3675

STEPWISE Selection

data scenic;
 infile 'scenic.dat' firstobs=2;
 input Stay  Age Risk Culture Chest Beds 
     School Region Census Nurses Facil;
 Nratio = Nurses/Census;
proc reg  data=scenic;
  model Risk = Culture Stay Nurses Nratio 
     Chest Beds Census Facil / 
     selection=stepwise sle=0.20 sls=0.05;
run ;
EDITED SAS OUTPUT (Complete output)
               Stepwise Procedure for Dependent Variable RISK    
Step 1   Variable CULTURE Entered   R-square = 0.31265864   C(p) = 47.47794976
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       1            62.96314170      62.96314170      50.49   0.0001
 Error          111           138.41668131       1.24699713
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP       3.19789965      0.19376813     339.64905575     272.37   0.0001
 CULTURE        0.07325862      0.01030975      62.96314170      50.49   0.0001
--------------------------------------------------------------------------------
Step 2   Variable STAY Entered      R-square = 0.45040256   C(p) = 18.11960703
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       2            90.70198757      45.35099379      45.07   0.0001
 Error          110           110.67783543       1.00616214
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP       0.80549102      0.48775579       2.74400250       2.73   0.1015
 CULTURE        0.05645147      0.00979843      33.39687778      33.19   0.0001
 STAY           0.27547211      0.05246473      27.73884588      27.57   0.0001
--------------------------------------------------------------------------------
Step 3   Variable FACIL Entered     R-square = 0.49340010   C(p) = 10.33092385
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       3            99.36082444      33.12027481      35.39   0.0001
 Error          109           102.01899857       0.93595412
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP       0.49133226      0.48163614       0.97401801       1.04   0.3099
 CULTURE        0.05419997      0.00947933      30.59827862      32.69   0.0001
 STAY           0.22390748      0.05336561      16.47664606      17.60   0.0001
 FACIL          0.01963027      0.00645392       8.65883687       9.25   0.0029
--------------------------------------------------------------------------------
Step 4   Variable NRATIO Entered    R-square = 0.52547952   C(p) =  5.02782551
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       4           105.82097194      26.45524298      29.90   0.0001
 Error          108            95.55885107       0.88480418
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.49505513      0.59376426       0.61507231       0.70   0.4063
 CULTURE        0.04818092      0.00948204      22.84513509      25.82   0.0001
 STAY           0.26758404      0.05434637      21.44995791      24.24   0.0001
 NRATIO         0.79262357      0.29333869       6.46014750       7.30   0.0080
 FACIL          0.01747585      0.00632554       6.75349077       7.63   0.0067
--------------------------------------------------------------------------------
Step 5   Variable CHEST Entered     R-square = 0.53792463   C(p) =  4.19461013
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       5           108.32716704      21.66543341      24.91   0.0001
 Error          107            93.05265597       0.86965099
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.76804342      0.61022741       1.37763165       1.58   0.2109
 CULTURE        0.04318856      0.00984976      16.71979631      19.23   0.0001
 STAY           0.23392650      0.05741114      14.43814950      16.60   0.0001
 NRATIO         0.67240318      0.29931440       4.38883521       5.05   0.0267
 CHEST          0.00917860      0.00540681       2.50619510       2.88   0.0925
 FACIL          0.01843860      0.00629673       7.45710068       8.57   0.0042
--------------------------------------------------------------------------------
Step 6   Variable CHEST Removed     R-square = 0.52547952   C(p) =  5.02782551
                 DF         Sum of Squares      Mean Square          F   Prob>F
 Regression       4           105.82097194      26.45524298      29.90   0.0001
 Error          108            95.55885107       0.88480418
 Total          112           201.37982301
                 Parameter        Standard          Type II
 Variable         Estimate           Error   Sum of Squares          F   Prob>F
 INTERCEP      -0.49505513      0.59376426       0.61507231       0.70   0.4063
 CULTURE        0.04818092      0.00948204      22.84513509      25.82   0.0001
 STAY           0.26758404      0.05434637      21.44995791      24.24   0.0001
 NRATIO         0.79262357      0.29333869       6.46014750       7.30   0.0080
 FACIL          0.01747585      0.00632554       6.75349077       7.63   0.0067
--------------------------------------------------------------------------------
All variables left in the model are significant at the 0.0500 level.
The stepwise method terminated because the next variable to be entered was just 
removed.
         Summary of Stepwise Procedure for Dependent Variable RISK    
        Variable        Number   Partial    Model
 Step   Entered Removed     In      R**2     R**2      C(p)          F   Prob>F
    1   CULTURE              1    0.3127   0.3127   47.4779    50.4918   0.0001
    2   STAY                 2    0.1377   0.4504   18.1196    27.5690   0.0001
    3   FACIL                3    0.0430   0.4934   10.3309     9.2513   0.0029
    4   NRATIO               4    0.0321   0.5255    5.0278     7.3012   0.0080
    5   CHEST                5    0.0124   0.5379    4.1946     2.8818   0.0925
    6           CHEST        4    0.0124   0.5255    5.0278     2.8818   0.0925

Comments on code and results

ALL SUBSETS

data scenic;
 infile 'scenic.dat' firstobs=2;
 input Stay  Age Risk Culture Chest Beds 
     School Region Census Nurses Facil;
 Nratio = Nurses / Census  ;
proc reg  data=scenic;
  model Risk = Culture Stay Nurses Nratio  
   Chest Beds Census Facil / selection=cp ;
run ;
EDITED SAS OUTPUT (Complete output)
N = 113     Regression Models for Dependent Variable: RISK    
      C(p)    R-square      Variables in Model
                        In   
   4.19461  0.53792463   5  CULTURE STAY NRATIO CHEST FACIL 
   4.81202  0.53521260   5  CULTURE STAY NRATIO CHEST CENSUS 
   5.02783  0.52547952   4  CULTURE STAY NRATIO FACIL 
   5.33543  0.53291351   5  CULTURE STAY NRATIO CHEST BEDS 
   5.38786  0.54146833   6  CULTURE STAY NRATIO CHEST CENSUS FACIL 
   5.69350  0.54012581   6  CULTURE STAY NRATIO CHEST BEDS FACIL 
   5.89630  0.53923499   6  CULTURE STAY NURSES NRATIO CHEST FACIL 
   6.00546  0.52118519   4  CULTURE STAY NRATIO CENSUS 
   6.23202  0.52897517   5  CULTURE STAY NURSES NRATIO CHEST 
   6.47628  0.51911707   4  CULTURE STAY NRATIO BEDS 
   6.50213  0.52778865   5  CULTURE STAY NRATIO CENSUS FACIL 
   6.70444  0.53568517   6  CULTURE STAY NURSES NRATIO CHEST CENSUS 
   6.73959  0.52674562   5  CULTURE STAY NRATIO BEDS FACIL 
   6.77459  0.53537702   6  CULTURE STAY NRATIO CHEST BEDS CENSUS 
   6.91746  0.52596429   5  CULTURE STAY NURSES NRATIO FACIL

displaymath106

  81.27048  0.17300751   2  BEDS FACIL 
  83.31964  0.15522130   1  NURSES 
  83.60929  0.17151925   3  NURSES BEDS CENSUS 
  84.59092  0.15842223   2  NURSES CENSUS 
  85.31844  0.15522654   2  NURSES BEDS 
  85.53858  0.14547441   1  CENSUS 
  86.28567  0.15097790   2  BEDS CENSUS 
  89.19019  0.12943445   1  BEDS 
 111.09898  0.03319840   1  NRATIO

Comments on code and results



Richard Lockhart
Mon Mar 10 11:03:54 PST 1997