STAT 350: Lecture 22
Goodness-of-fit: Pure Error Sum of Squares
If, for each (or at least sufficiently many) combination of covariates
in a data set, there are several observations, we can carry out an
extra sum of squares F-test to see if our regression model is adequate.
Suppose that
are the distinct rows of the design matrix
and suppose we have
observations for which the covariate values are those
in
,
observations with covariate pattern
and so on. Of course
. We compare our final fitted model with a so-called
saturated model by an extra sum of squares F-test. To be precise we
let
be the mean value of Y when the covariate pattern is
,
the mean corresponding to
and so on. Relabel the n data points
as
and fit a one way ANOVA model
to the
. The error sum of squares for this FULL model is
This ESS is called the pure error sum of squares because we have not assumed any particular relation between the mean of Y and the covariate vector x. We form the F statistic for testing the overall quality of our model by computing the ``lack of fit SS'' as
where the restricted model is the final model whose fit we are checking.
As an example return to the plaster hardness data of Lecture 12 There are 9 different covariate patterns corresponding to all the possible combinations of the 3 levels of SAND and 3 levels of FIBRE. There are two ways to compute the pure error sum of squares: create a new variable with 9 levels which labels the 9 categories or fit a two way ANOVA with interactions:
DATA
| 0 | 0 | 1 | 61 | 34 |
| 0 | 0 | 1 | 63 | 16 |
| 15 | 0 | 2 | 67 | 36 |
| 15 | 0 | 2 | 69 | 19 |
| 30 | 0 | 3 | 65 | 28 |
| 30 | 0 | 3 | 74 | 17 |
| 0 | 25 | 4 | 69 | 49 |
| 0 | 25 | 4 | 69 | 48 |
| 15 | 25 | 5 | 69 | 43 |
| 15 | 25 | 5 | 74 | 29 |
| 30 | 25 | 6 | 74 | 31 |
| 30 | 25 | 6 | 72 | 24 |
| 0 | 50 | 7 | 67 | 55 |
| 0 | 50 | 7 | 69 | 60 |
| 15 | 50 | 8 | 69 | 45 |
| 15 | 50 | 8 | 74 | 43 |
| 30 | 50 | 9 | 74 | 22 |
| 30 | 50 | 9 | 74 | 48 |
SAS CODE
options pagesize=60 linesize=80; data plaster; infile 'plaster1.dat'; input sand fibre combin hardness strength; proc glm data=plaster; model hardness = sand fibre; run; proc glm data=plaster; class sand fibre; model hardness = sand | fibre ; run; proc glm data=plaster; class combin; model hardness = combin; run;
EDITED OUTPUT (Complete output)
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 167.41666667 83.70833333 11.53 0.0009
Error 15 108.86111111 7.25740741
Corrected Total 17 276.27777778
________________________________________________________________________________
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 8 202.77777778 25.34722222 3.10 0.0557
Error 9 73.50000000 8.16666667
Corrected Total 17 276.27777778
________________________________________________________________________________
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 8 202.77777778 25.34722222 3.10 0.0557
Error 9 73.50000000 8.16666667
Corrected Total 17 276.27777778
From the output we can put together a summary ANOVA table
| Source | df | SS | MS | F | P |
| Model | 2 | 167.417 | 83.708 | ||
| Lack of Fit | 6 | 35.361 | 5.894 | 0.722 | 0.64 |
| Pure Error | 9 | 73.500 | 8.167 | ||
| Total (Corrected) | 17 | 276.278 |