## 3.14 Data season = c(rep("Summer",10), rep("Shoulder",7), rep("Winter",8)) score = c(83,85,85,87,90,88,88,84,91,90,91,87,84,87,85,86,83,94,91,87,85,87,91,92,86) obs = 1:length(score) data314 = data.frame(obs,season,score) ### aov stands for Analysis of Variance ### It's the fastest way to get values like the ### SSE (sum of squares error) and ### SST (sum of squares total) aov(score ~ season, data=data314) ### 35.60 for SS_Season ### 184.63 for SS_Error ### -------------------- ### 220.24 for SS_Total ## Variance is mean squared error without considering factors ## MST = SST / (n-1) ## SST = MST * (n-1) n = length(data314$score) var(data314$score)*(n - 1) ## 220.24 ### A more user-friendly command is anova() but it requires lm() to be used first ### lm is the linear model function we can use it to compare these groups lm(score ~ season, data=data314) ### That's not very informative, so we can use summary() to get more out of it. mod = lm(score ~ season, data=data314) summary(mod) ## The (intercept) is the mean response of the first level of the factor alphabetically ## in this case, that's "Shoulder". ## If there were multiple factors, (intercept) would use first level of each one. ## seasonSummer is the additional mean score from playing in summer rather than the shoulder ## seasonWinter is the same, but for winter rather than shoulder ## to verify attach(data314) ## makes the default dataset data314 mean( score[which(season == "Shoulder")]) mean( score[which(season == "Summer")]) ## 0.9571 more than Shoulder mean( score[which(season == "Winter")]) ## 2.9821 more than Shoulder ## Now that we have the linear model, we can put in the anova() mod = lm(score ~ season, data=data314) anova(mod) ## Where it gives you F-statistic and the p-value against the null that ## all the groups have equal means. ## Summary also gives you some of this information. ## Finally, you can get other information from the linear model ## We may want to see the residuals. mod$residuals ## See the residuals plot(mod$residuals) ## plotting them shapiro.test(mod$residuals) ## Checking for normality qqnorm(mod$residuals) ## Or use the fitted values, coefficients, or see what the model was without ## reading the whole summary with $call mod$fitted.values mod$coefficents mod$df.residual mod$call ########## Example with 3.21 ## 3.21 Data orifice = rep( c(0.37,0.51,0.71,1.02,1.40,1.99), each=4) radon = c(80,83,83,85,75,75,79,79,74,73,76,77,67,72,74,74,62,62,67,69,60,61,64,66) obs = 1:length(radon) data321 = data.frame(obs,radon,orifice) ## There are six levels to our factor now, different sizes of orifice diameter. ## If we take an aov() aov(radon ~ orifice, data=data321) ## Orifice is only taking 1 degree of freedom.. why? ## A linear model tells us more mod1 = lm(radon ~ orifice, data=data321) mod1 ## intercept = 84.22 mean( radon[which(data321$orifice == 0.37)]) ## 82.75 ## The intercept isn't giving the mean response of the 1st level of orifice ## it's giving the expected repsonse for an orifice of diameter 0 ## the coefficient for orifice is the slope; the reduction in the response ## for every unit increase in orifice. Not what we wanted at all. ## Note for now the (unadjusted) r-squared is 0.8258 summary(mod1) ## Orifice needs to be treated as a factor data321.f = data321 ## .f for factor data321.f$orifice = as.factor(data321$orifice) ## Now to try the aov again ## should need 5 df for 6 levels - 1 one. (and 24-6=18 residual) aov(radon ~ orifice, data=data321.f) ## And the linear model shows five coefficients ## The intercept is the mean response for orifice size 82.750 ## mean response for orifice size is 5.750 less than that. mod2 = lm(radon ~ orifice, data=data321.f) ## The r-squared as been improved to 0.8955 summary(mod2) ## Verify mean( radon[which(data321.f$orifice == 0.37)]) mean( radon[which(data321.f$orifice == 0.51)]) ## ANOVA assumes that that variances within each group can be pooled. ## We can check this with the Bartlett test ## Fail to reject that null, we're good bartlett.test(data321.f$radon, data321.f$orifice) ## Finally, the anova() anova(mod2) ####### CONTRASTS (page 92) ## t-tests determine if two group means are significantly different ## ANOVA determines if any of 3 or more group means are sigificantly different ## contrasts are used to determine more complex differences ## Say, with the radon example, the only difference we cared about was whether ## the orifice diameter was greater than 1.00 or not. ## The first three groups have less than 1, the last three more than 1. ## So we could be interested in the contrast ## (u1 + u2 + u3) - (u4 + u5 + u6) ## It's easy enough to find. u1 = mean( radon[which(data321.f$orifice == 0.37)]) u2 = mean( radon[which(data321.f$orifice == 0.51)]) u3 = mean( radon[which(data321.f$orifice == 0.71)]) u4 = mean( radon[which(data321.f$orifice == 1.02)]) u5 = mean( radon[which(data321.f$orifice == 1.40)]) u6 = mean( radon[which(data321.f$orifice == 1.99)]) contrast.1 = (u1 + u2 + u3) - (u4 + u5 + u6) c1 = c(1,1,1,-1,-1,-1) # +1 for each positive u1,u2... -1 for each negative ### Which is great, but a measure is nothing without variance. ### Sigma2 is the mean squared error ### From anova(mod2) we know an unbiased estimate of this is the mse, 7.437 mse = 7.347 n = length(data321.f$radon) ## From 3.26 on p.92 var.contr1 = mse/n * sum(c1^2) ### If the first three means equal the last three, ### contrast.1 should be 0, or distributed about zero with a t-distribution. ### Why t? It's based on a finite sum of values... 24 to be exact. ### It's based on 6 values, so it should be t with (24-6) = 18 degrees of freedom. ## From 3.27 on p.93 ## t = value / sqrt(var) t.contr1 = contrast.1 / sqrt(var.contr1) t.contr1 1 - pt(t.contr1,df=18) ######### Contrast 2, all vs 1 ## Another common case is where everything is being compared to the mean of single group ## In medicine this happens when multiple treatments are being assigned, ## and their effect is being compared to a control. ## To see if last five groups are different from the first, we can't use... u1 - (u2 + u3 + u4 + u5 + u6) ### Because we want the contrast to be centered around zero ### it's the differences in the means that matters, so if there's no ### difference, we want to be able to predict what the contrast will be. ## To compensate, the side with fewer groups is multiplied. contrast.2 = 5*u1 - (u2 + u3 + u4 + u5 + u6) c2 = c(5,-1,-1,-1,-1,-1) ## This isn't perfect. With so much riding on one group, the variance ## of that group matters more. var.contr2 = mse/n * sum(c2^2) ## The t-score also depends a lot on that one group t.contr2 = contrast.2 / sqrt(var.contr2) t.contr2 1 - pt(t.contr2,df=18) ### Note 1: I recommend you read the part on "Unequal Sample Sizes" on page 94 ### Just at the end of 3.5.4 - Contrasts. The t-statistic and sum of squares - contrast ### have an interesting implication: If a group is more important for comparison ### it's efficient to put more of your sample into that group. ### In our contrast 2 case, it would have been optimal to make the sample ### of the 0.37 group sqrt(5) times as big as the others. ## Note 2: I've yet to find a clean way to change the level order other than to change the names of ## levels. (eg. A_Low, B_Medium, C_High), if anyone finds or knows a better way ## you will have my thanks though.