Reading:
See Lecture 18 for plots of the data. Here we fit several models and discuss interpretation of the coefficients.
I begin by thinking about how the variables should influence Risk. It seems natural that risk of infection would increase with length of stay. Of course this data provides only ecological correlations, that is, average risk for a whole group of patients is being related to average stay. Nothing in the data can indicate clearly whether or not patients with long stays are actually getting infected more often. Nevertheless, it seems reasonable that if we could hold other features of the hospital environment constant then hospitals with long average stays would expose their patients to more risk, that is, would have a higher probability of infection. Similar remarks should be made about all the discussion below.
The variable Age would be expected to be positively correlated with risk of infection since, presumably, older patients are more susceptible.
The two variables Chest and Culture are measures of how hard the hospital looks for otherwise unsuspected infection. If you look harder you should find more so that the coefficients of these variables in any regression model would be expected to be positive.
The variables Beds, Census and Nurses all measure the size of the hospital. It is not clear if a big hospital should put a patient at risk of infection or not. However, I point out that the relations between these three variables might make a difference. A hospital with high Census compared to Beds may be overcrowded and so more likely to support infections. A hospital with lots of Nurses relative to patients might be expected to be kept in better condition and so lower the risk.
The variable Facilities seems to measure the sophistication of medical treatment at the hospital. This might be positive or it might suggest more exotic diseases among the patient population so I can't guess intelligently about the direction of the effect.
We begin by fitting a model using all the continuous predictors, that is ignoring only SCHOOL and REGION. Here is Splus code and results.
> fit.full <- lm(Risk ~ Stay + Age + Culture + Chest + Beds + Census +
Nurses + Facilities, data=scenic)
> summary(fit.full)
Call: lm(formula = Risk ~ Stay + Age + Culture + Chest + Beds + Census +
Nurses + Facilities, data = scenic)
Residuals:
Min 1Q Median 3Q Max
-2.222 -0.5918 0.01833 0.5453 2.556
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -0.7473 1.2076 -0.6188 0.5374
Stay 0.1769 0.0691 2.5621 0.0118
Age 0.0162 0.0223 0.7285 0.4679
Culture 0.0470 0.0107 4.3719 0.0000
Chest 0.0120 0.0055 2.1943 0.0304
Beds -0.0014 0.0027 -0.5340 0.5945
Census 0.0007 0.0035 0.2097 0.8343
Nurses 0.0019 0.0018 1.0873 0.2794
Facilities 0.0163 0.0102 1.5978 0.1131
Residual standard error: 0.959 on 104 degrees of freedom
Multiple R-Squared: 0.5251
F-statistic: 14.37 on 8 and 104 degrees of freedom, the p-value is 6.117e-14
Correlation of Coefficients:
(Intercept) Stay Age Culture Chest Beds Census Nurses
Stay -0.0146
Age -0.8622 -0.3347
Culture -0.1606 -0.2930 0.3042
Chest -0.1914 -0.3039 0.0281 -0.2911
Beds -0.0377 0.2726 -0.0775 -0.0551 0.0077
Census 0.0536 -0.4625 0.1369 0.1530 0.0657 -0.8766
Nurses 0.0041 0.2486 -0.0410 -0.1935 -0.0657 -0.1681 -0.2265
Facilities -0.1544 -0.0475 -0.0232 -0.0363 -0.0613 -0.2009 0.0670 -0.2229
The output suggests that Stay, Culture and Chest are important predictors
but none of the others are. So we fit next the model retaining only these
3 predictors.
> fit.1 <- lm( Risk ~ Stay + Culture + Chest, data=scenic)
> summary(fit.1)
Call: lm(formula = Risk ~ Stay + Culture + Chest, data = scenic)
Residuals:
Min 1Q Median 3Q Max
-2.181 -0.7678 -0.04002 0.696 2.594
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.3092 0.5425 0.5700 0.5698
Stay 0.2450 0.0540 4.5349 0.0000
Culture 0.0494 0.0103 4.8008 0.0000
Chest 0.0110 0.0056 1.9839 0.0498
Residual standard error: 0.99 on 109 degrees of freedom
Multiple R-Squared: 0.4696
F-statistic: 32.16 on 3 and 109 degrees of freedom, the p-value is 5.662e-15
Correlation of Coefficients:
(Intercept) Stay Culture
Stay -0.6633
Culture 0.1766 -0.1963
Chest -0.4611 -0.2848 -0.3435
We can compare these models in Splus by carrying out an extra Sum of
Squares F-test. I have edited the output to make it fit - changing
the columns labelled Terms to the shorthand FULL and REDUCED
> anova(fit.1,fit.full)
Analysis of Variance Table
Response: Risk
Terms Resid. Df RSS Test Df Sum of Sq F Value Pr(F)
1 REDUCED 109 106.8206
2 FULL 104 95.6398 5 11.18079 2.431627 0.03972748
You see that the test of the hypothesis of no influence of the 5 extra
variables suggests that you cannot delete them all, even though individual
t-tests for the 5 individual coefficients are not significant. To see
what can happen consider adding each of the three variables which measure
size
> fit.b <- lm(Risk ~ Stay + Culture + Chest + Beds , data = scenic)
> fit.c <- lm(Risk ~ Stay + Culture + Chest + Census, data = scenic)
> fit.n <- lm(Risk ~ Stay + Culture + Chest + Nurses, data = scenic)
> summary(fit.b)
Call: lm(formula = Risk ~ Stay + Culture + Chest + Beds, data = scenic)
Residuals:
Min 1Q Median 3Q Max
-1.993 -0.7365 0.05695 0.66 2.291
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.4149 0.5309 0.7816 0.4362
Stay 0.1845 0.0578 3.1940 0.0018
Culture 0.0480 0.0101 4.7701 0.0000
Chest 0.0130 0.0055 2.3761 0.0193
Beds 0.0013 0.0005 2.5516 0.0121
Residual standard error: 0.9658 on 108 degrees of freedom
Multiple R-Squared: 0.4997
F-statistic: 26.97 on 4 and 108 degrees of freedom, the p-value is
1.554e-15
Correlation of Coefficients:
(Intercept) Stay Culture Chest
Stay -0.6351
Culture 0.1714 -0.1558
Chest -0.4439 -0.3155 -0.3475
Beds 0.0780 -0.4099 -0.0560 0.1424
> summary(fit.c)
Call: lm(formula = Risk ~ Stay + Culture + Chest + Census, data = scenic)
Residuals:
Min 1Q Median 3Q Max
-1.984 -0.7584 0.07387 0.6545 2.447
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.5233 0.5357 0.9770 0.3307
Stay 0.1719 0.0599 2.8694 0.0049
Culture 0.0484 0.0101 4.8199 0.0000
Chest 0.0132 0.0055 2.3952 0.0183
Census 0.0017 0.0007 2.5656 0.0117
Residual standard error: 0.9655 on 108 degrees of freedom
Multiple R-Squared: 0.5
F-statistic: 27 on 4 and 108 degrees of freedom, the p-value is 1.554e-15
Correlation of Coefficients:
(Intercept) Stay Culture Chest
Stay -0.6504
Culture 0.1683 -0.1542
Chest -0.4270 -0.3189 -0.3452
Census 0.1558 -0.4757 -0.0384 0.1497
> summary(fit.n)
Call: lm(formula = Risk ~ Stay + Culture + Chest + Nurses, data = scenic)
Residuals:
Min 1Q Median 3Q Max
-1.958 -0.7093 0.02961 0.5473 2.453
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.3703 0.5241 0.7065 0.4814
Stay 0.1936 0.0549 3.5263 0.0006
Culture 0.0456 0.0100 4.5505 0.0000
Chest 0.0127 0.0054 2.3480 0.0207
Nurses 0.0021 0.0007 2.9949 0.0034
Residual standard error: 0.9556 on 108 degrees of freedom
Multiple R-Squared: 0.5102
F-statistic: 28.13 on 4 and 108 degrees of freedom, the p-value is 5.551e-16
Correlation of Coefficients:
(Intercept) Stay Culture Chest
Stay -0.6417
Culture 0.1700 -0.1450
Chest -0.4544 -0.3008 -0.3519
Nurses 0.0389 -0.3126 -0.1276 0.1013
You should note that each of the measures of size is significant; that is, it appears that you should add each of them. But because the three measures of size are quite multicolinear you get no additional predictive power out of including more than one of them in the model. Look what happens when I add both Census and Beds:
> fit.bc <- lm(Risk ~ Stay + Culture + Chest + Census + Beds, data = scenic)
> summary(fit.bc)
Call: lm(formula = Risk ~ Stay + Culture + Chest + Census + Beds, data = scenic)
Residuals:
Min 1Q Median 3Q Max
-1.986 -0.7498 0.07771 0.6563 2.386
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.4841 0.5742 0.8431 0.4010
Stay 0.1760 0.0637 2.7611 0.0068
Culture 0.0483 0.0101 4.7610 0.0000
Chest 0.0131 0.0055 2.3797 0.0191
Census 0.0011 0.0034 0.3243 0.7463
Beds 0.0005 0.0026 0.1956 0.8453
Residual standard error: 0.9699 on 107 degrees of freedom
Multiple R-Squared: 0.5002
F-statistic: 21.42 on 5 and 107 degrees of freedom, the p-value is 8.549e-15
Correlation of Coefficients:
(Intercept) Stay Culture Chest Census
Stay -0.6906
Culture 0.1887 -0.1749
Chest -0.3926 -0.3080 -0.3417
Census 0.3715 -0.4140 0.0810 0.0513
Beds -0.3492 0.3301 -0.0906 -0.0215 -0.9794
Notice again that neither the coefficient of Beds nor that of Census is
significantly different from 0. The reason for the confusion is in the
matrix of correlations (which is
(XTX)-1 converted to correlation
form by dividing the ijth entry by the square root of the product of
the ith and jth diagonal entries). In particular the Here is Splus code for an ANOVA table comparing the model with Beds and Census to the model without them.
> anova(fit.1,fit.bc)
Analysis of Variance Table
Response: Risk
Terms Resid. Df RSS Test Df Sum of Sq F Value Pr(F)
1 Stay + Culture + Chest 109 106.8206
2 Stay + Culture + Chest + Census + Beds 107 100.6482 +Census+Beds 2 6.17243 3.280984 0.04140679
Finally look what happens if we put in Beds and Nurses.
> fit.nb <- lm(Risk ~ Stay + Culture + Chest + Nurses + Beds, data = scenic)
> summary(fit.nb)
Call: lm(formula = Risk ~ Stay + Culture + Chest + Nurses + Beds, data = scenic)
Residuals:
Min 1Q Median 3Q Max
-1.959 -0.7159 0.05094 0.5114 2.513
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.3523 0.5290 0.6660 0.5069
Stay 0.1998 0.0582 3.4312 0.0009
Culture 0.0451 0.0102 4.4383 0.0000
Chest 0.0125 0.0055 2.2807 0.0245
Nurses 0.0026 0.0017 1.5528 0.1234
Beds -0.0004 0.0012 -0.3336 0.7394
Residual standard error: 0.9596 on 107 degrees of freedom
Multiple R-Squared: 0.5107
F-statistic: 22.34 on 5 and 107 degrees of freedom, the p-value is 2.776e-15
Correlation of Coefficients:
(Intercept) Stay Culture Chest Nurses
Stay -0.6371
Culture 0.1819 -0.1818
Chest -0.4364 -0.3217 -0.3285
Nurses -0.0763 0.1693 -0.1821 -0.0678
Beds 0.1019 -0.3230 0.1422 0.1211 -0.9079
Notice particularly that Beds now has a negative coefficient (though
not significantly so).
SUMMARY: Because this is an observational study you cannot interpret the parameter estimates as measuring the amount by which the response would be expected to increase if you increased the corresponding variable by 1 unit. The trouble is that in the population, large values of Beds go with, usually, larger values of Census. So the coefficient of Beds measures something rather more like: how much would the response increase if I increased Beds by 1 unit and all the other covariates changed as you would expect from the relations in the variables seen in this population. Thus you can't really base a policy decision about building more beds in a hospital on the sign of the coefficient of Beds in a regression model like this.