next up previous

STAT 350: Lecture 22

Goodness-of-fit: Pure Error Sum of Squares

If, for each (or at least sufficiently many) combination of covariates in a data set, there are several observations, we can carry out an extra sum of squares F-test to see if our regression model is adequate. Suppose that tex2html_wrap_inline38 are the distinct rows of the design matrix and suppose we have tex2html_wrap_inline40 observations for which the covariate values are those in tex2html_wrap_inline42 , tex2html_wrap_inline44 observations with covariate pattern tex2html_wrap_inline46 and so on. Of course tex2html_wrap_inline48 . We compare our final fitted model with a so-called saturated model by an extra sum of squares F-test. To be precise we let tex2html_wrap_inline52 be the mean value of Y when the covariate pattern is tex2html_wrap_inline42 , tex2html_wrap_inline58 the mean corresponding to tex2html_wrap_inline46 and so on. Relabel the n data points as tex2html_wrap_inline64 and fit a one way ANOVA model to the tex2html_wrap_inline66 . The error sum of squares for this FULL model is

displaymath68

This ESS is called the pure error sum of squares because we have not assumed any particular relation between the mean of Y and the covariate vector x. We form the F statistic for testing the overall quality of our model by computing the ``lack of fit SS'' as

displaymath76

where the restricted model is the final model whose fit we are checking.

As an example return to the plaster hardness data of Lecture 12 There are 9 different covariate patterns corresponding to all the possible combinations of the 3 levels of SAND and 3 levels of FIBRE. There are two ways to compute the pure error sum of squares: create a new variable with 9 levels which labels the 9 categories or fit a two way ANOVA with interactions:

DATA

0 0 1 61 34
0 0 1 63 16
15 0 2 67 36
15 0 2 69 19
30 0 3 65 28
30 0 3 74 17
0 25 4 69 49
0 25 4 69 48
15 25 5 69 43
15 25 5 74 29
30 25 6 74 31
30 25 6 72 24
0 50 7 67 55
0 50 7 69 60
15 50 8 69 45
15 50 8 74 43
30 50 9 74 22
30 50 9 74 48

SAS CODE

  options pagesize=60 linesize=80;
  data plaster;
  infile 'plaster1.dat';
  input sand fibre combin hardness strength;
  proc glm  data=plaster;
   model hardness = sand fibre;
  run;
  proc glm  data=plaster;
   class sand fibre;
   model hardness = sand | fibre ;
  run;
  proc glm  data=plaster;
   class combin;
   model hardness = combin;
  run;

EDITED OUTPUT (Complete output)

                                     Sum of            Mean
Source                  DF          Squares          Square   F Value     Pr > F
Model                    2     167.41666667     83.70833333     11.53     0.0009
Error                   15     108.86111111      7.25740741
Corrected Total         17     276.27777778
________________________________________________________________________________
                                     Sum of            Mean
Source                  DF          Squares          Square   F Value     Pr > F
Model                    8     202.77777778     25.34722222      3.10     0.0557
Error                    9      73.50000000      8.16666667
Corrected Total         17     276.27777778
________________________________________________________________________________
                                     Sum of            Mean
Source                  DF          Squares          Square   F Value     Pr > F
Model                    8     202.77777778     25.34722222      3.10     0.0557
Error                    9      73.50000000      8.16666667
Corrected Total         17     276.27777778

From the output we can put together a summary ANOVA table

Source df SS MS F P
Model 2 167.417 83.708
Lack of Fit 6 35.361 5.894 0.722 0.64
Pure Error 9 73.500 8.167
Total (Corrected) 17 276.278


next up previous


Richard Lockhart
Fri Feb 28 09:53:38 PST 1997