STAT 350: Lecture 24
Categorical covariates
or
Example
Consider a small version of the car mileage example on assignment 3. Imagine we have only the 5 data points below.
| VEHICLE 1 | VEHICLE 2 | ||
| Mileage | Emission Rate | Mileage | Emission Rate |
| 0 | 50 | 0 | 40 |
| 1000 | 56 | 1100 | 49 |
| 2000 | 58 | ||
For the model equation
we have
,
. The
are the 5 numbers
0, 1000, 2000, 0, 1100. For this parametrization the
design matrix is
For the parametrization
the design matrix simply is that above with an extra column of 1's:
Since columns 2 and 3 add together to give the first column the
matrix has rank 4 and
is singular.
If we define the parameters
,
and
then
. As a result we can write the model
equations as
and
and then the design matrix is
Alternatively corner point coding leads to the design matrix
All these design matrixes have the same column spaces so they must lead to the same fitted values, same residuals and the same error sum of squares. The hypothesis of no "Vehicle" effect, that is, that the two cars have the same intercept is tested either by a t-test on the parameter which is the difference of intercepts or by an extra sum of squares F-test comparing with the restricted model in which just 1 straight line is fitted.
One important point is that in all the parametrizations the
parameter "difference of intercepts" has the same estimate.
This is true even for the matrix
for which
is singular.
Factors with more than two levels
Let us now examine what happens if we add two categorical variables, SCHOOL and REGION, to our model using sas.
SAS CODE
options pagesize=60 linesize=80; data scenic; infile 'scenic.dat' firstobs=2; input Stay Age Risk Culture Chest Beds School Region Census Nurses Facil; Nratio = Nurses / Census ; proc glm data=scenic; class School Region; model Risk = Culture Stay Nurses Nratio School Region; run ; proc glm data=scenic; class School Region; model Risk = Culture Stay Nurses School Region; run ; proc glm data=scenic; class School Region; model Risk = Culture Stay Nurses Region; run ;EDITED OUTPUT
Class Levels Values
SCHOOL 2 1 2
REGION 4 1 2 3 4
Dependent Variable: RISK
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 8 110.94402256 13.86800282 15.95 0.0001
Error 104 90.43580045 0.86957500
Total 112 201.37982301
R-Square C.V. Root MSE RISK Mean
0.550919 21.41305 0.9325101 4.3548673
Source DF Type I SS Mean Square F Value Pr > F
CULTURE 1 62.96314170 62.96314170 72.41 0.0001
STAY 1 27.73884588 27.73884588 31.90 0.0001
NURSES 1 7.01369438 7.01369438 8.07 0.0054
NRATIO 1 5.97484076 5.97484076 6.87 0.0101
SCHOOL 1 1.24877748 1.24877748 1.44 0.2335
REGION 3 6.00472236 2.00157412 2.30 0.0815
Source DF Type III SS Mean Square F Value Pr > F
CULTURE 1 27.43863928 27.43863928 31.55 0.0001
STAY 1 26.44898274 26.44898274 30.42 0.0001
NURSES 1 6.39021516 6.39021516 7.35 0.0079
NRATIO 1 1.74482880 1.74482880 2.01 0.1596
SCHOOL 1 2.21945688 2.21945688 2.55 0.1132
REGION 3 6.00472236 2.00157412 2.30 0.0815
________________________________________________________________
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 7 109.19919376 15.59988482 17.77 0.0001
Error 105 92.18062925 0.87791075
Total 112 201.37982301
R-Square C.V. Root MSE RISK Mean
0.542255 21.51544 0.9369689 4.3548673
Source DF Type I SS Mean Square F Value Pr > F
CULTURE 1 62.96314170 62.96314170 71.72 0.0001
STAY 1 27.73884588 27.73884588 31.60 0.0001
NURSES 1 7.01369438 7.01369438 7.99 0.0056
SCHOOL 1 2.16544259 2.16544259 2.47 0.1193
REGION 3 9.31806922 3.10602307 3.54 0.0173
Source DF Type III SS Mean Square F Value Pr > F
CULTURE 1 32.63679640 32.63679640 37.18 0.0001
STAY 1 24.70628794 24.70628794 28.14 0.0001
NURSES 1 8.99075614 8.99075614 10.24 0.0018
SCHOOL 1 3.19583271 3.19583271 3.64 0.0591
REGION 3 9.31806922 3.10602307 3.54 0.0173
________________________________________________________________
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 6 106.00336105 17.66722684 19.64 0.0001
Error 106 95.37646196 0.89977794
Corrected Total 112 201.37982301
R-Square C.V. Root MSE RISK Mean
.526385 21.78175 0.9485663 4.3548673
Source DF Type I SS Mean Square F Value Pr > F
CULTURE 1 62.96314170 62.96314170 69.98 0.0001
STAY 1 27.73884588 27.73884588 30.83 0.0001
NURSES 1 7.01369438 7.01369438 7.79 0.0062
REGION 3 8.28767910 2.76255970 3.07 0.0310
Source DF Type III SS Mean Square F Value Pr > F
CULTURE 1 30.50324858 30.50324858 33.90 0.0001
STAY 1 22.98974524 22.98974524 25.55 0.0001
NURSES 1 5.85040582 5.85040582 6.50 0.0122
REGION 3 8.28767910 2.76255970 3.07 0.0310
CONCLUSIONS