Simple Linear Regression

The REG Procedure

Simple Linear Regression

Suppose that a response variable Y can be predicted by a linear function of a regressor variable X. You can estimate $\beta_0$ , the intercept, and $\beta_1$ , the slope, in

$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$

for the observations i = 1,2, ... ,n. Fitting this model with the REG procedure requires only the following MODEL statement, where y is the outcome variable and x is the regressor variable.

   proc reg;
      model y=x;
   run;

For example, you might use regression analysis to find out how well you can predict a child's weight if you know that child's height. The following data are from a study of nineteen children. Height and weight are measured for each child.

   title 'Simple Linear Regression';
   data Class;
      input Name $ Height Weight Age @@;
      datalines;
   Alfred  69.0 112.5 14  Alice  56.5  84.0 13  Barbara 65.3  98.0 13
   Carol   62.8 102.5 14  Henry  63.5 102.5 14  James   57.3  83.0 12
   Jane    59.8  84.5 12  Janet  62.5 112.5 15  Jeffrey 62.5  84.0 13
   John    59.0  99.5 12  Joyce  51.3  50.5 11  Judy    64.3  90.0 14
   Louise  56.3  77.0 12  Mary   66.5 112.0 15  Philip  72.0 150.0 16
   Robert  64.8 128.0 12  Ronald 67.0 133.0 15  Thomas  57.5  85.0 11
   William 66.5 112.0 15
   ;

The equation of interest is

${Weight} = \beta_0 + \beta_1 {Height} + \epsilon$

The variable Weight is the response or dependent variable in this equation, and $\beta_0$ and $\beta_1$ are the unknown parameters to be estimated. The variable Height is the regressor or independent variable, and $\epsilon$ is the unknown error. The following commands invoke the REG procedure and fit this model to the data.

   proc reg;
      model Weight = Height;
   run;

Figure 55.1 includes some information concerning model fit.

Simple Linear Regression

The REG Procedure

Model: MODEL1

Dependent Variable: Weight

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	7193.24912	7193.24912	57.08	<.0001
Error	17	2142.48772	126.02869
Corrected Total	18	9335.73684

Root MSE	11.22625	R-Square	0.7705
Dependent Mean	100.02632	Adj R-Sq	0.7570
Coeff Var	11.22330

Figure 55.1: ANOVA Table

The F statistic for the overall model is highly significant (F=57.076, p<0.0001), indicating that the model explains a significant portion of the variation in the data.

The degrees of freedom can be used in checking accuracy of the data and model. The model degrees of freedom are one less than the number of parameters to be estimated. This model estimates two parameters, $\beta_0$ and $\beta_1$ ;thus, the degrees of freedom should be 2-1=1. The corrected total degrees of freedom are always one less than the total number of observations in the data set, in this case 19-1=18.

Several simple statistics follow the ANOVA table. The Root MSE is an estimate of the standard deviation of the error term. The coefficient of variation, or Coeff Var, is a unitless expression of the variation in the data. The R-Square and Adj R-Square are two statistics used in assessing the fit of the model; values close to 1 indicate a better fit. The R-Square of 0.77 indicates that Height accounts for 77% of the variation in Weight.

The "Parameter Estimates" table shown in Figure 55.2 contains the estimates of $\beta_0$ and $\beta_1$ .The table also contains the t statistics and the corresponding p-values for testing whether each parameter is significantly different from zero. The p-values (t=-4.432, p=0.0004 and t=7.555, p<0.0001) indicate that the intercept and Height parameter estimates, respectively, are highly significant.

Simple Linear Regression

The REG Procedure

Model: MODEL1

Dependent Variable: Weight

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	-143.02692	32.27459	-4.43	0.0004
Height	1	3.89903	0.51609	7.55	<.0001

Figure 55.2: Parameter Estimates

From the parameter estimates, the fitted model is

Weight = -143.0 + 3.9 × Height

The REG procedure can be used interactively. After you specify a model with the MODEL statement and submit the PROC REG statements, you can submit further statements without reinvoking the procedure. The following command can now be issued to request a plot of the residual versus the predicted values, as shown in Figure 55.3.

   plot r.*p.;
   run;

Figure 55.3: Plot of Residual vs. Predicted Values

A trend in the residuals would indicate nonconstant variance in the data. Figure 55.3 may indicate a slight trend in the residuals; they appear to increase slightly as the predicted values increase. A fan-shaped trend may indicate the need for a variance-stabilizing transformation. A curved trend (such as a semi-circle) may indicate the need for a quadratic term in the model. Since these residuals have no apparent trend, the analysis is considered to be acceptable.

Chapter Contents
Previous
Next
Top