next up previous

STAT 350: Lecture 21

Diagnostics

In addition to the residual plots already discussed there are a number of formal statistical procedures available for diagnosing problems with the fitted model.

Problems with individual data points

SCENIC data example

I use SAS to fit the final selected model: covariates used are STAY, CULTURE, NURSES, NURSE.RATIO.

options pagesize=60 linesize=80;
data scenic;
 infile 'scenic.dat' firstobs=2;
 input Stay  Age Risk Culture Chest Beds School Region Census Nurses Facil;
 Nratio = Nurses / Census  ;
proc glm  data=scenic;
  model Risk = Culture Stay Nurses Nratio ;
  output out=scout P=Fitted PRESS=PRESS H=HAT RSTUDENT =EXTST R=RESID DFFITS=DFFITS COOKD=COOKD;
run ;
proc print data=scout;

Complete SAS Output is here.

Here is a plot of the leverages against the observation number. (The text calls a plot in which one variable is the observation number an "index" plot.)

We find that observations 4, 8, 47, 54 and 112 have leverages over 0.15 (many more are over 10/113 the suggested cut off - I prefer to plot the leverages and look at the largest few). Observations 4 and 47, in particular, have leverages over 0.3 and should be looked at.

Now I look at influence measures.

COOK'S DISTANCE

In this plot observations 8, 11, 54 and 112 have values of tex2html_wrap_inline145 larger than 0.05. Of these, only observation 11 is new. The text recommends worrying only about observations for which tex2html_wrap_inline145 is larger than the tenth to twentieth percentile of the tex2html_wrap_inline149 distribution. In this case those critical points are 0.3? and 0.46. None of the observations exceeds even the lowest of these numbers.

DFFITS

Finally case deleted residuals:

Notice that only observation 53 is added for our consideration, though with 113 residuals a value of 2.9 is not terribly unusual.

Here are the covariate values for observations 4, 8, 11, 47, 53, 54 and 112:

Observation Culture Stay Nurses Nratio Risk
4 18.9 8.95 148 2.79 5.6
8 60.5 11.18 360 0.90 5.4
11 28.5 11.07 656 1.11 4.9
47 17.2 19.56 172 0.63 6.5
53 16.6 11.41 273 0.83 7.6
54 52.4 12.07 76 0.66 7.8
112 26.4 17.94 407 0.51 5.9
Mean 15.8 9.65 173 0.95
SD 10.2 1.91 139 0.11

It may be seen that observation 4 has a quite unusual value of Nurse.Ratio - a lot of nurses - and observation 47 has quite a high average Stay for patients. The others are harder to interpret but 4 and 47 are the most leveraged observations. In summary it appears that several observations exert excess influence on the fitting process. As a final method of judging whether or not our fit was unduly influenced by these observations I fit the model again in SAS but removing observations number 4, 8, and 47.
                                     Sum of            Mean
Source                  DF          Squares          Square   F Value     Pr > F
Model                    4     100.46168102     25.11542026     28.21     0.0001
Error                  105      93.49504625      0.89042901
Corrected Total        109     193.95672727
                  R-Square             C.V.        Root MSE            RISK Mean
                  0.517959         21.87080       0.9436255            4.3145455
                                        T for H0:    Pr > |T|   Std Error of
Parameter                  Estimate    Parameter=0                Estimate
INTERCEPT              -.1511778299          -0.21     0.8349     0.72370376
CULTURE                0.0568635139           5.28     0.0001     0.01077276
STAY                   0.2773500736           4.18     0.0001     0.06629165
NURSES                 0.0016666813           2.30     0.0232     0.00072362
NRATIO                 0.7024480620           1.92     0.0578     0.36620665
Compare these results to the corresponding parts of the same code applied to the full data set.

Dependent Variable: RISK   
                                     Sum of            Mean
Source                  DF          Squares          Square   F Value     Pr > F
Model                    4     103.69052272     25.92263068     28.66     0.0001
Error                  108      97.68930029      0.90453056
Corrected Total        112     201.37982301
                  R-Square             C.V.        Root MSE            RISK Mean
                  0.514900         21.83920       0.9510681            4.3548673
                                        T for H0:    Pr > |T|   Std Error of
Parameter                  Estimate    Parameter=0                Estimate
INTERCEPT              -.0831378994          -0.14     0.8917     0.60917500
CULTURE                0.0482485831           5.03     0.0001     0.00959016
STAY                   0.2767441333           5.04     0.0001     0.05489077
NURSES                 0.0015865156           2.26     0.0258     0.00070177
NRATIO                 0.7694874096           2.57     0.0115     0.29939874

SUMMARY

The differences seem minor so there is little harm in just sticking to the model fitted at the start of these notes.


next up previous

Richard Lockhart
Fri Feb 28 09:46:53 PST 1997