next up previous

STAT 350: Lecture 23

Goodness-of-fit: Pure Error Sum of Squares, An Example

DATA

0 0 1 61 34
0 0 1 63 16
15 0 2 67 36
15 0 2 69 19
tex2html_wrap_inline120
30 50 9 74 48

SAS CODE

  data plaster;
  infile 'plaster1.dat';
  input sand fibre combin hardness strength;
  proc glm  data=plaster;
   model hardness = sand fibre;
  run;
  proc glm  data=plaster;
   class sand fibre;
   model hardness = sand | fibre ;
  run;
  proc glm  data=plaster;
   class combin;
   model hardness = combin;
  run;

EDITED OUTPUT

                            Sum of          Mean
Source DF        Squares        Square  F Value  Pr > F
Model   2   167.41666667   83.70833333   11.53   0.0009
Error  15   108.86111111    7.25740741
Total  17   276.27777778
_______________________________________________________
                            Sum of          Mean
Source DF        Squares        Square  F Value  Pr > F
Model   8   202.77777778   25.34722222     3.10  0.0557
Error   9    73.50000000    8.16666667
Total  17   276.27777777
_______________________________________________________
                           Sum of          Mean
Source DF        Squares        Square  F Value  Pr > F
Model   8   202.77777778   25.34722222    3.10   0.0557
Error   9    73.50000000    8.16666667
Total  17   276.27777778

From the output we can put together a summary ANOVA table

Source df SS MS F P
Model 2 167.417 83.708
Lack of Fit 6 35.361 5.894 0.722 0.64
Pure Error 9 73.500 8.167
Total (Corrected) 17 276.278

Making an Added variable plot: example

Here is the added variable plot:

Categorical Covariates

Fitting models with categorical covariates

Suppose a categorical variable has K levels. Relabel the data as tex2html_wrap_inline140 where j runs from 1 to tex2html_wrap_inline144 and i runs from 1 to K. Here tex2html_wrap_inline144 is the number of observations with the categorical variable at level i. We fit the model

displaymath154

where now tex2html_wrap_inline156 is the vector of slopes for, say, p continuous covariates and tex2html_wrap_inline160 is the intercept which depends on the level i of the categorical variable.

This model does not have a column of 1's in the design matrix. It can be fitted by specifying /NOINT in SAS, for example. It is common, however, to reparametrize in such a way that the model has a column of 1's and the hypothesis of no effect of the factor, that is, tex2html_wrap_inline164 is simply the hypothesis that the coefficients of some columns of the design matrix are 0. We usually do this by defining tex2html_wrap_inline166 to be a weighted average of the intercepts, that is,

displaymath168

or by defining tex2html_wrap_inline166 to be the intercept for level 1 of the factor, that is, tex2html_wrap_inline172 . In either case we define some new parameters tex2html_wrap_inline174 . The model equation is now

displaymath176

Notice that in either case the tex2html_wrap_inline178 satisfy a linear restriction: either

displaymath180

or

displaymath182

If we forget about this linear restriction then our linear reparametrization increases the number of columns of the design matrix by 1 but without increasing the rank of X so that the new tex2html_wrap_inline186 would be singular. SAS does the algebra without worrying about this by simply finding 1 of infinitely many possible solutions to the normal equations. I usually suggest the definition of tex2html_wrap_inline166 as an average intercept. Then I eliminate tex2html_wrap_inline190 by writing

displaymath192

This changes the rows of the design matrix corresponding to observations at level K. The other definition of tex2html_wrap_inline166 as tex2html_wrap_inline198 is called corner point coding and the column of the design matrix corresponding to tex2html_wrap_inline200 is dropped.

Example

Consider a small version of the car mileage example on assignment 3. Imagine we have only the 5 data points below.

VEHICLE 1 VEHICLE 2
Mileage Emission Rate Mileage Emission Rate
0 50 0 40
1000 56 1100 49
2000 58

For the model equation

displaymath202

we have tex2html_wrap_inline204 , tex2html_wrap_inline206 . The tex2html_wrap_inline208 are the 5 numbers 0, 1000, 2000, 0, 1100. For this parametrization the design matrix is

displaymath210

For the parametrization

displaymath212

the design matrix simply is that above with an extra colmn of 1's:

displaymath214

Since columns 2 and 3 add together to give the first column the matrix has rank 4 and tex2html_wrap_inline186 is singular.

If we define the parameters tex2html_wrap_inline218 , tex2html_wrap_inline220 and tex2html_wrap_inline222 then tex2html_wrap_inline224 . As a result we can write the model equations as

displaymath226

and

displaymath228

and then the design matrix is

displaymath230

Alternatively corner point coding leads to the design matrix

displaymath232

All these design matrixes have the same column spaces so they must lead to the same fitted values, same residuals and the same error sum of squares. The hypothesis of no "Vehicle" effect, that is, that the two cars have the same intercept is tested either by a t-test on the parameter which is the difference of intercepts or by an extra sum of squares F-test comparing with the restricted model in which just 1 straight line is fitted.

One important point is that in all the parametrizations the parameter "difference of intercepts" has the same estimate. This is true even for the matrix tex2html_wrap_inline238 for which tex2html_wrap_inline240 is singular.

Factors with more than two levels

Let us now examine what happens if we add two categorical variables, SCHOOL and REGION, to our model using sas.

SAS CODE

options pagesize=60 linesize=80;
data scenic;
 infile 'scenic.dat' firstobs=2;
 input Stay  Age Risk Culture Chest Beds 
       School Region Census Nurses Facil;
 Nratio = Nurses / Census  ;
proc glm  data=scenic;
  class School Region;
  model Risk = Culture Stay Nurses 
     Nratio School Region;
run ;
proc glm  data=scenic;
  class School Region;
  model Risk = Culture Stay 
    Nurses School Region;
run ;
proc glm  data=scenic;
  class School Region;
  model Risk = Culture Stay Nurses Region;
run ;

EDITED OUTPUT

                           Class    Levels    Values
                           SCHOOL        2    1 2
                           REGION        4    1 2 3 4
Dependent Variable: RISK   
                     Sum of            Mean
Source     DF       Squares          Square   F Value     Pr > F
Model       8  110.94402256     13.86800282     15.95     0.0001
Error     104   90.43580045      0.86957500
Total     112  201.37982301
     R-Square      C.V.        Root MSE            RISK Mean
     0.550919   21.41305       0.9325101            4.3548673
Source     DF     Type I SS     Mean Square   F Value     Pr > F
CULTURE     1   62.96314170     62.96314170     72.41     0.0001
STAY        1   27.73884588     27.73884588     31.90     0.0001
NURSES      1    7.01369438      7.01369438      8.07     0.0054
NRATIO      1    5.97484076      5.97484076      6.87     0.0101
SCHOOL      1    1.24877748      1.24877748      1.44     0.2335
REGION      3    6.00472236      2.00157412      2.30     0.0815
Source     DF   Type III SS     Mean Square   F Value     Pr > F
CULTURE     1   27.43863928     27.43863928     31.55     0.0001
STAY        1   26.44898274     26.44898274     30.42     0.0001
NURSES      1    6.39021516      6.39021516      7.35     0.0079
NRATIO      1    1.74482880      1.74482880      2.01     0.1596
SCHOOL      1    2.21945688      2.21945688      2.55     0.1132
REGION      3    6.00472236      2.00157412      2.30     0.0815
________________________________________________________________
                     Sum of            Mean
Source     DF       Squares          Square   F Value     Pr > F
Model       7  109.19919376     15.59988482     17.77     0.0001
Error     105   92.18062925      0.87791075
Total     112  201.37982301
     R-Square      C.V.        Root MSE            RISK Mean
     0.542255   21.51544       0.9369689            4.3548673
Source     DF     Type I SS     Mean Square   F Value     Pr > F
CULTURE     1   62.96314170     62.96314170     71.72     0.0001
STAY        1   27.73884588     27.73884588     31.60     0.0001
NURSES      1    7.01369438      7.01369438      7.99     0.0056
SCHOOL      1    2.16544259      2.16544259      2.47     0.1193
REGION      3    9.31806922      3.10602307      3.54     0.0173
Source     DF   Type III SS     Mean Square   F Value     Pr > F
CULTURE     1   32.63679640     32.63679640     37.18     0.0001
STAY        1   24.70628794     24.70628794     28.14     0.0001
NURSES      1    8.99075614      8.99075614     10.24     0.0018
SCHOOL      1    3.19583271      3.19583271      3.64     0.0591
REGION      3    9.31806922      3.10602307      3.54     0.0173
________________________________________________________________
                     Sum of            Mean
Source     DF       Squares          Square   F Value     Pr > F
Model       6  106.00336105     17.66722684     19.64     0.0001
Error     106   95.37646196      0.89977794
Corrected Total     112     201.37982301
        R-Square    C.V.        Root MSE            RISK Mean
       .526385   21.78175       0.9485663            4.3548673
Source     DF     Type I SS     Mean Square   F Value     Pr > F
CULTURE     1   62.96314170     62.96314170     69.98     0.0001
STAY        1   27.73884588     27.73884588     30.83     0.0001
NURSES      1    7.01369438      7.01369438      7.79     0.0062
REGION      3    8.28767910      2.76255970      3.07     0.0310
Source     DF   Type III SS     Mean Square   F Value     Pr > F
CULTURE     1   30.50324858     30.50324858     33.90     0.0001
STAY        1   22.98974524     22.98974524     25.55     0.0001
NURSES      1    5.85040582      5.85040582      6.50     0.0122
REGION      3    8.28767910      2.76255970      3.07     0.0310

CONCLUSIONS



next up previous

Richard Lockhart
Fri Feb 28 10:04:26 PST 1997