STAT 350: 95-3
Assignment 5
Part A
Write
and
Then
Since
and
simply omit the ith rows we see
We have
Again omitting the ith term we find
and give a formula for the scalar r.
Simply multiply
The trick most students missed is that in the last term
the quantity
is a scalar, that is, just a single number
(check that it is a
matrix). It may help to give it a
name like s to see that the last term is
Thus
which is the identity matrix if
Solving gives
. Many students
got a formula like
which is a
matrix not a scalar.
and give a formula for
in terms of the leverage
.
Put
and
. You get
and
We have
Multiply
by
to get
The variance of the PRESS residual is
.
The externally studentized residual is the PRESS residual divided
by an estimate of its standard error where
is
estimated by
. Thus the externally studentized residual
is
To get the simplest formula we must actually simplify
.
The middle term vanishes because
Use
to get
and then you can deduce the final formula given in class.
Here is code for all the methods and with all subsets done both
using
and using adjusted
.
data nit; infile 'nit.dat' ; input nitexc weight dryin wetin nitin ; proc reg data=nit; model nitexc = weight dryin wetin nitin /selection=FORWARD; run ; proc reg data=nit; model nitexc = weight dryin wetin nitin /selection=BACKWARD; run ; proc reg data=nit; model nitexc = weight dryin wetin nitin /selection=STEPWISE; run ; proc reg data=nit; model nitexc = weight dryin wetin nitin /selection=CP; run ; proc reg data=nit; model nitexc = weight dryin wetin nitin /selection=ADJRSQ; run ;
The conclusion of the
output
is that BACKWARD and STEPWISE settle
for the model containing only Nitrogen Intake as a predictor. The
forward selection method also includes Wet Intake because of the very high
level of
(0.5) to enter. The all subsets method using
would settle on the model using only nitrogen intake but the adjusted
method also includes Wet Intake. However, overall there seems
little reason to include Wet Intake since it improves the fit very
little and is not very significant at all.
Here is code for the analysis of variance.
options pagesize=60 linesize=80; data electron; infile 'anova.dat' firstobs=2; input Time Sex Sequence Exper Replic; proc glm data=electron; class Sex Sequence Exper ; model time = Sex|Sequence|Exper ; means sex sequence exper sex*sequence*exper; estimate 'sexdif' sex 1 -1; estimate 'seq12dif' sequence 1 -1 0; estimate 'seq13dif' sequence 1 0 -1; estimate 'seq23dif' sequence 0 1 -1; estimate 'expdif' exper 1 -1; output out=anovres r=resid; proc rank data=anovres normal=blom out=ressc; var resid; ranks nscores; proc corr data=ressc; var resid nscores; run;COMMENTS
You can read the type III sums of squares table to do F tests without doing multiple runs because each effect has a sum of squares which is unaffected by the presence of the others. The conclusion is that the three way interaction is insignificant and all three two way interactions are insignificant. All three main effects are significant; none can be eliminated.
In the question on Bonferroni confidence intervals the 5 quantities to be estimated are each estimated by a difference of two averages so that the variance of the estimated difference is of the form
You work out standard t type confidence intervals by estimating
the means as usual (see the means statment) and replacing
by the MSE in the formula for the variance. The
Bonferroni method just replaces the
of 0.01 by
for 5 confidence intervals. The t critical
value is
. For the sex difference for instance the
average time for men is 1155.933 seconds while for women it is 966.1333.
The difference is 189.8 and this is computed by the means statement whose
output is
T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate sexdif 189.800000 25.10 0.0001 7.56325180 seq12dif -57.250000 -6.18 0.0001 9.26305385 seq13dif 6.600000 0.71 0.4796 9.26305385 seq23dif 63.850000 6.89 0.0001 9.26305385 expdif 159.666667 21.11 0.0001 7.56325180
Notice the standard errors. The mean squared error in the model is 858.06. There
are 30 men and 30 women so the variance of the difference between men's average
and women's average is 858.06(1/30+1/30) = 57.204. The standard error of the estimate
of
is then
as in the
output. The desired confidence intervals are all and estimate from the column
labelled Estimate plus or minus 2.41 times a standard error from the last column.
The needed mean for
is produced by means. The key point is that
the standard error to attach to the mean is
which is based
on 48 degrees of freedom, not on 4.
For this question I used the SAS code
data knees; infile 'knees.dat' firstobs=2; input Days Fitness Patient Age ; proc glm data=knees; class Fitness ; model Days = Fitness|Age ; proc glm data=knees; class Fitness ; model Days = Fitness Age ; output out=anovres r=resid p=fitted; proc rank data=anovres normal=blom out=ressc; var resid; ranks nscores; proc corr data=ressc; var resid nscores; proc print data=ressc; var fitted fitness age resid nscores; run;getting the output ( complete output )
General Linear Models Procedure
Dependent Variable: DAYS
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 5 1082.0560870 216.4112174 655.36 0.0001
Error 18 5.9439130 0.3302174
Correctd Totl 23 1088.0000000
R-Square C.V. Root MSE DAYS Mean
0.994537 1.795767 0.5746454 32.000000
Source DF Type III SS Mean Square F Value Pr > F
FITNESS 2 5.44989183 2.72494592 8.25 0.0029
AGE 1 369.44147783 369.44147783 1118.78 0.0001
AGE*FITNESS 2 0.22183487 0.11091744 0.34 0.7191
General Linear Models Procedure
Dependent Variable: DAYS
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 1081.8342521 360.6114174 1169.72 0.0001
Error 20 6.1657479 0.3082874
Correctd Totl 23 1088.0000000
R-Square C.V. Root MSE DAYS Mean
0.994333 1.735114 0.5552363 32.000000
Source DF Type III SS Mean Square F Value Pr > F
FITNESS 2 246.08370505 123.04185252 399.11 0.0001
AGE 1 409.83425209 409.83425209 1329.39 0.0001
Correlation Analysis
Pearson Correlation Coefficients
RESID
NSCORES 0.99488
OBS FITTED FITNESS AGE RESID NSCORES
1 28.7930 1 18.3 0.20697 0.26136
2 42.4503 1 30.0 -0.45028 -0.87524
3 38.3648 1 26.5 -0.36478 -0.60318
4 40.2324 1 28.1 -0.23244 -0.48332
5 42.1001 1 29.7 0.89991 1.94690
6 39.8822 1 27.8 0.11775 0.05171
7 30.5440 1 19.8 -0.54396 -1.03865
8 41.6332 1 29.3 0.36682 0.73241
9 29.8639 2 20.8 0.13613 0.15568
10 34.9999 2 25.2 0.00007 -0.05171
11 39.6691 2 29.2 -0.66907 -1.23590
12 28.9300 2 20.0 -0.93004 -1.49843
13 30.6810 2 21.5 0.31903 0.60318
14 31.3813 2 22.1 -0.38134 -0.73241
15 28.5799 2 19.7 0.42015 0.87524
16 34.4163 2 24.7 0.58372 1.03865
17 29.1635 2 20.2 -0.16349 -0.26136
18 32.3152 2 22.9 0.68483 1.23590
19 25.2062 3 22.7 0.79380 1.49843
20 32.2099 3 28.7 -0.20991 -0.37006
21 20.7705 3 18.9 0.22949 0.37006
22 19.7199 3 18.0 0.28005 0.48332
23 24.0389 3 21.7 -1.03891 -1.94690
24 22.0545 3 20.0 -0.05452 -0.15568
For part a of 25.11 the residuals are printed out above. For part b the plots desired are:
For part c, the generalized model is
and the null hypothesis is that all the
are equal. This is tested by
comparing the two model statements
model days = fitness | age and model days = fitness age, doing an
extra sum of squares F test. The resulting F statistic is the
Type III sum of squares for fitness*age giving F=0.34 and P=0.7191.
The null hypothesis is accepted.
For 25.12 part c the F statistic is obtained from the Type III sum of squares for fitness for the model days = fitness age statement. This has F=399.11 and P=0.0001 There is clearly an effect of the variable Fitness.
The estimate statements permit us to compare the three levels of fitness. SAS prints out estimates of the differences in the intercepts. The relevant output is
T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate High v Low Fitness -8.72289277 -26.20 0.0001 0.33296397 High v Med Fitness -6.87551411 -23.84 0.0001 0.28837673 Med v Low Fitness -1.84737866 -6.44 0.0001 0.28694289Notice that the fit group appears to recuperate about 9 days faster than the unfit group!