STAT 330 Lecture 34
Reading for Today's Lecture: Chapter 13.
Goals of Today's Lecture:
Today's notes
The General Linear Model
for
Special Cases and Examples:
One Way Layout:
with parameters
,
and p=I or
parameters
.
Note:
is redundant because
Special notes:
Two Way Layout without replicates:
with the restrictions
and
Multiple Regression
In multiple regression we have an equation like the above but with the
filled in with the values of more than 1 independent variable:
Example: We now regress hardness on SAND and FIBRE content. Previously we had
treated each of these variables as merely having 3 (unordered)
categories. Now we use the numerical values of those categories as the
and
.
All the models above can be written in the form
In the two way layout example we have, for instance:
Analysis Principles
giving the matrix algebra solution
| Source | SS | df |
| Regression | | p |
| Error | | n-p |
| Total | | n |
| (not corrected) |
| Source | SS | df |
| Regression | | p-1 |
| Error | | n-p |
| Total | | n |
| (corrected) |
SAS example: Multiple Regression
The data consist of casting hardnesses for 18 samples prepared under 3 levels of sand added and 3 levels of carbon fibre added. See Q 15 in Chapter 11. I use proc glm to regress hardness on sand content and fibre content but now treat them as continuous variables.
I ran the following SAS code:
options pagesize=60 linesize=80; data plaster; infile 'plaster.dat'; input sand fibre hardness strength; proc glm data=plaster; model hardness = sand fibre; output out=plasfit p=yhat r=resid ; proc univariate data=plasfit plot normal; var resid; proc plot; plot resid*sand; plot resid*fibre; plot resid*yhat; run;
The line labelled model says that I am interested in the effects of sand and fibre; the lack of the class statment makes glm do multiple regression.
The abridged output from proc glm is:
General Linear Models Procedure
Number of observations in data set = 18
Dependent Variable: HARDNESS
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 167.41666667 83.70833333 11.53 0.0009
Error 15 108.86111111 7.25740741
Corrected Total 17 276.27777778
R-Square C.V. Root MSE HARDNESS Mean
0.605972 3.870011 2.6939576 69.611111
Source DF Type I SS Mean Square F Value Pr > F
SAND 1 102.08333333 102.08333333 14.07 0.0019
FIBRE 1 65.33333333 65.33333333 9.00 0.0090
T for H0: Pr > |T| Std Error of
Parameter Estimate Parameter=0 Estimate
INTERCEPT 64.36111111 50.68 0.0001 1.26994378
SAND 0.19444444 3.75 0.0019 0.05184524
FIBRE 0.09333333 3.00 0.0090 0.03110714
The conclusions are that both sand and fibre have an effect on hardness (I read the so called Type 1 SS table and see P values of 0.0019 and 0.0090 and reject the two null hypotheses). The last table permits confidence intervals for the slopes. You can, for instance, predict that a SAND content of 10% and a FIBRE content of 20% would produce a hardness of
The model fit should be checked by examining various diagnostic statistics and plots:
Univariate Procedure
Variable=RESID
Moments
N 18 Sum Wgts 18
Mean 0 Sum 0
Std Dev 2.530533 Variance 6.403595
Skewness -0.1431 Kurtosis -0.29863
USS 108.8611 CSS 108.8611
CV . Std Mean 0.596452
T:Mean=0 0 Pr>|T| 1.0000
Num ^= 0 18 Num > 0 7
M(Sign) -2 Pr>=|M| 0.4807
Sgn Rank 0.5 Pr>=|S| 0.9915
W:Normal 0.976631 Pr<W 0.8888
Quantiles(Def=5)
100% Max 4.388889 99% 4.388889
75% Q3 2.055556 95% 4.388889
50% Med -0.40278 90% 3.805556
25% Q1 -1.36111 10% -3.36111
0% Min -5.19444 5% -5.19444
1% -5.19444
Range 9.583333
Q3-Q1 3.416667
Mode -0.86111
Extremes
Lowest Obs Highest Obs
-5.19444( 5) 2.055556( 16)
-3.36111( 1) 2.305556( 7)
-2.94444( 15) 2.305556( 8)
-2.02778( 13) 3.805556( 6)
-1.36111( 2) 4.388889( 10)
Stem Leaf # Boxplot
4 4 1 |
2 1338 4 +-----+
0 57 2 | + |
-0 4996530 7 *-----*
-2 490 3 |
-4 2 1 |
----+----+----+----+
Normal Probability Plot
5+ ++*+++++
| **++++*++
| ++++**++
| *+*++** **
| ++*+*++*
-5+ +++++*++
+----+----+----+----+----+----+----+----+----+----+
-2 -1 0 +1 +2
Plot of RESID*SAND. Legend: A = 1 obs, B = 2 obs, etc.
RESID |
|
6 +
|
|
|
|
|
| A
4 +
| A
|
|
|
|
|B
2 + A
| A
| A
|
|
|
|
0 +A
| A
| A A
| B
|
|A
|
-2 +A
|
|
| A
|
|A
|
-4 +
|
|
|
| A
|
|
-6 +
|
-+-----------------------------------+-----------------------------------+
0 15 30
SAND
Plot of RESID*FIBRE. Legend: A = 1 obs, B = 2 obs, etc.
RESID |
|
6 +
|
|
|
|
|
| A
4 +
|A
|
|
|
|
| B
2 + A
|A
| A
|
|
|
|
0 + A
|A
| B
| B
|
|A
|
-2 + A
|
|
| A
|
|A
|
-4 +
|
|
|
|A
|
|
-6 +
|
-+-----------------------------------+-----------------------------------+
0 25 50
FIBRE
Plot of RESID*YHAT. Legend: A = 1 obs, B = 2 obs, etc.
RESID |
|
6 +
|
|
|
|
|
| A
4 +
| A
|
|
|
|
| B
2 + A
| A
| A
|
|
|
|
0 + A
| A
| A A
| B
|
| A
|
-2 + A
|
|
| A
|
| A
|
-4 +
|
|
|
| A
|
|
-6 +
|
-+-----------+-----------+-----------+-----------+-----------+-----------+
64 66 68 70 72 74 76
YHAT
The diagnostic plots seem fine to me.
In the two way ANOVA model fit for this data we allowed the possibility that effect of SAND depended on the level of FIBRE. We can do the same here and include an interaction term in the model. The model equation fitted by the previous run of SAS is
for
. Here Y is hardness, u is sand content (in %) and
v is fibre content in percent. To include an interaction term we modify the
model equation to
The coefficient
is then the interaction.
options pagesize=60 linesize=80; data plaster; infile 'plaster.dat'; input sand fibre hardness strength; proc anova data=plaster; model hardness = sand|fibre; run;which produces
General Linear Models Procedure
Dependent Variable: HARDNESS
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 168.54166667 56.18055556 7.30 0.0035
Error 14 107.73611111 7.69543651
Corrected Total 17 276.27777778
R-Square C.V. Root MSE HARDNESS Mean
0.610044 3.985089 2.7740650 69.611111
Source DF Type I SS Mean Square F Value Pr > F
SAND 1 102.08333333 102.08333333 13.27 0.0027
FIBRE 1 65.33333333 65.33333333 8.49 0.0113
SAND*FIBRE 1 1.12500000 1.12500000 0.15 0.7079
T for H0: Pr > |T| Std Error of
Parameter Estimate Parameter=0 Estimate
INTERCEPT 63.98611111 39.14 0.0001 1.63463347
SAND 0.21944444 2.60 0.0210 0.08441211
FIBRE 0.10833333 2.14 0.0505 0.05064727
SAND*FIBRE -0.00100000 -0.38 0.7079 0.00261541
There is no sign of a need for an interaction term so the original model
seems to be reasonable. Notice that the resulting model with only 3
parameters is more parsimonious than the model for the two way layout which
has 5 parameters (or 9 with an interaction term). The model asserts that
hardness actually increases linearly with sand content and also with fibre
content.