Generalized Linear Models Theory

The GENMOD Procedure

Generalized Linear Models Theory

This is a brief introduction to the theory of generalized linear models . See the "References" section for sources of more detailed information.

Response Probability Distributions

In generalized linear models, the response is assumed to possess a probability distribution of the exponential form. That is, the probability density of the response Y for continuous response variables, or the probability function for discrete responses, can be expressed as

$f(y) = \exp \{ \frac{y\theta - b(\theta)} {a(\phi)} + c(y,\phi) \}$

for some functions a, b, and c that determine the specific distribution. For fixed $\phi$ , this is a one parameter exponential family of distributions. The functions a and c are such that $a(\phi) = \phi / w$ and $c = c(y, \phi / w)$ ,where w is a known weight for each observation. A variable representing w in the input data set may be specified in the WEIGHT statement. If no WEIGHT statement is specified, w_i=1 for all observations.

Standard theory for this type of distribution gives expressions for the mean and variance of Y.

$E(Y) & = & b^'(\theta) \ {Var}(Y) & = & \frac{b^' '(\theta) \phi}w \$

where the primes denote derivatives with respect to $\theta$ .If $\mu$ represents the mean of Y, then the variance expressed as a function of the mean is

${Var}(Y) = \frac{V(\mu) \phi}w \$

where V is the variance function.

Probability distributions of the response Y in generalized linear models are usually parameterized in terms of the mean $\mu$ and dispersion parameter $\phi$ instead of the natural parameter $\theta$ .The probability distributions that are available in the GENMOD procedure are shown in the following list. The PROC GENMOD scale parameter and the variance of Y are also shown.

Normal:
$f(y) & = & \frac{1}{\sqrt{2\pi} \sigma} \exp [ -\frac{1}2 ( \frac{y-\mu}{\sigm... ...ty \ \phi & = & \sigma^2 \ {scale} & = & \sigma \ Var(Y) & = & \sigma^2 \$
Inverse Gaussian:
$f(y) & = & \frac{1}{\sqrt{2\pi y^3} \sigma} \exp [ -\frac{1}{2y} ( \frac{y-\mu... ...phi & = & \sigma^2 \ {scale} & = & \sigma \ {Var}(Y) & = & \sigma^2 \mu^3 \$
Gamma:
$f(y) & = & \frac{1}{\Gamma(\nu)y} ( \frac{y\nu}{\mu} )^{\nu} \exp (-\frac{y \n... ...phi & = & \nu^{-1} \ {scale} & = & \nu \ {Var}(Y) & = & \frac{\mu^2}{\nu} \$
Negative Binomial:
$f(y) & = & \frac{\Gamma(y+1/k)}{\Gamma(y+1)\Gamma(1/k)} \frac{(k\mu)^k}{(1+k\mu... ...or } y = 0,1,2, ... \ {dispersion} & = & k \ {Var}(Y) & = & \mu + k\mu^2 \$
Poisson:
$f(y) & = & \frac{\mu^y e^{-\mu}}{y!} {for } y = 0,1,2, ... \ \phi & = & 1 \ {Var}(Y) & = & \mu \$
Binomial:
$f(y) & = & {n \choose r} \mu^r (1-\mu)^{n-r} {for } y=\frac{r}n, r=0,1, 2, ... ,n \ \phi & = & 1 \ {Var}(Y) & = & \frac{\mu(1-\mu)}n \$
Multinomial:
$f(y_1, y_2, ... ,y_k) & = & \frac{m!}{y_1! y_2! ... y_k!}p_1^{y_1} p_2^{y_2} ... p_k^{y_k}$

The negative binomial distribution contains a parameter k, called the negative binomial dispersion parameter. This is not the same as the generalized linear model dispersion $\phi$ , but it is an additional distribution parameter that must be estimated or set to a fixed value.

For the binomial distribution, the response is the binomial proportion Y = events/ trials. The variance function is $V(\mu) = \mu(1-\mu)$ , and the binomial trials parameter n is regarded as a weight w.

If a weight variable is present, $\phi$ is replaced with $\phi/w$ , where w is the weight variable.

PROC GENMOD works with a scale parameter that is related to the exponential family dispersion parameter $\phi$ instead of with $\phi$ itself. The scale parameters are related to the dispersion parameter as shown previously with the probability distribution definitions. Thus, the scale parameter output in the "Analysis of Parameter Estimates" table is related to the exponential family dispersion parameter. If you specify a constant scale parameter with the SCALE= option in the MODEL statement, it is also related to the exponential family dispersion parameter in the same way.

Link Function

The mean $\mu_i$ of the response in the ith observation is related to a linear predictor through a monotonic differentiable link function g.

$g(\mu_i) = {x_{i}}'{{\beta}}$

Here, x_i is a fixed known vector of explanatory variables, and ${\beta}$ is a vector of unknown parameters.

Log-Likelihood Functions

Log-likelihood functions for the distributions that are available in the procedure are parameterized in terms of the means $\mu_i$ and the dispersion parameter $\phi$ .The term y_i represents the response for the ith observation, and w_i represents the known dispersion weight. The log-likelihood functions are of the form

$L(y,{\mu}, \phi) = \sum_i \log ( f(y_i,\mu_i,\phi) )$

where the sum is over the observations. The forms of the individual contributions

$l_i = \log ( f(y_i,\mu_i ,\phi) )$

are shown in the following list; the parameterizations are expressed in terms of the mean and dispersion parameters.

Normal:
$l_i = -\frac{1}2 [ \frac{w_i(y_i-\mu_i)^2}{\phi} + \log ( \frac{\phi}{w_i} ) + \log(2 \pi) ]$
Inverse Gaussian:
$l_i = -\frac{1}2 [ \frac{w_i(y_i-\mu_i)^2}{y_i \mu^2 \phi} + \log ( \frac{\phi y_i^3}{w_i} ) + \log(2 \pi) ]$
Gamma:
$l_i = \frac{w_i}{\phi} \log ( \frac{w_i y_i}{\phi \mu_i} ) - \frac{w_i y_i}{\phi \mu_i} - \log(y_i) - \log ( \Gamma ( \frac{w_i}{\phi} ) )$
Negative Binomial:
$l_i = y\log(k\mu) - (y+1/k)\log(1+k\mu) + \log(\frac{\Gamma(y+1/k)} {\Gamma(y+1)\Gamma(1/k)})$
Poisson:
$l_i = y_i \log(\mu_i) - \mu_i$
Binomial:
l_i = [r_i log(p_i) + (n_i-r_i) log(1-p_i)]
Multinomial:
$l_i = \sum_j y_{ij}\log(\mu_{ij})$

For the binomial, multinomial, and Poisson distribution, terms involving binomial coefficients or factorials of the observed counts are dropped from the computation of the log-likelihood function since they do not affect parameter estimates or their estimated covariances.

Maximum Likelihood Fitting

The GENMOD procedure uses a ridge-stabilized Newton-Raphson algorithm to maximize the log-likelihood function $L(y,{\mu}, \phi)$ with respect to the regression parameters. By default, the procedure also produces maximum likelihood estimates of the scale parameter as defined in the "Response Probability Distributions" section for the normal, inverse Gaussian, negative binomial, and gamma distributions.

On the rth iteration, the algorithm updates the parameter vector ${{\beta}}_{r}$ with

${{\beta}}_{r+1} = {{\beta}}_{r} - H^{-1}s$

where H is the Hessian (second derivative) matrix, and s is the gradient (first derivative) vector of the log-likelihood function, both evaluated at the current value of the parameter vector. That is,

$s = [s_j] = [ \frac{\partial L}{\partial \beta_j} ]$

and

$H = [{h_{ij}}] = [ \frac{\partial^2 L} {\partial\beta_i\partial\beta_j} ]$

In some cases, the scale parameter is estimated by maximum likelihood. In these cases, elements corresponding to the scale parameter are computed and included in s and H.

If $\eta_i = {x_{i}}'{{\beta}}$ is the linear predictor for observation i and g is the link function, then $\eta_i = g(\mu_i)$ , so that $\mu_i = g^{-1}({x_{i}}'{{\beta}})$ is an estimate of the mean of the ith observation, obtained from an estimate of the parameter vector ${\beta}$ .

The gradient vector and Hessian matrix for the regression parameters are given by

$s & = & \sum_i \frac{w_i (y_i - \mu_i)x_i} {V(\mu_i) g^'(\mu_i) \phi} \ H & = & -X^'W_o X \$

where X is the design matrix, x_i is the transpose of the ith row of X, and V is the variance function. The matrix W_o is diagonal with its ith diagonal element

$w_{oi} = w_{ei} + w_i(y_i - \mu_i) \frac{V(\mu_i)g^' '(\mu_i) + V^'(\mu_i)g^'(\mu_i)} {(V(\mu_i))^2 (g^'(\mu_i))^3 \phi}$

where

$w_{ei} = \frac{w_i}{\phi V(\mu_i)(g^'(\mu_i))^2}$

The primes denote derivatives of g and V with respect to $\mu$ .The negative of H is called the observed information matrix. The expected value of W_o is a diagonal matrix W_e with diagonal values w_ei. If you replace W_o with W_e, then the negative of H is called the expected information matrix. W_e is the weight matrix for the Fisher's scoring method of fitting. Either W_o or W_e can be used in the update equation. The GENMOD procedure uses Fisher's scoring for iterations up to the number specified by the SCORING option in the MODEL statement, and it uses the observed information matrix on additional iterations.

Covariance and Correlation Matrix

The estimated covariance matrix of the parameter estimator is given by

${\Sigma} = -H^{-1}$

where H is the Hessian matrix evaluated using the parameter estimates on the last iteration. Note that the dispersion parameter, whether estimated or specified, is incorporated into H. Rows and columns corresponding to aliased parameters are not included in ${\Sigma}$ .

The correlation matrix is the normalized covariance matrix. That is, if $\sigma_{ij}$ is an element of ${\Sigma}$ , then the corresponding element of the correlation matrix is $\sigma_{ij}/\sigma_i\sigma_{j}$ ,where $\sigma_i = \sqrt{\sigma_{ii}}$ .

Goodness of Fit

Two statistics that are helpful in assessing the goodness of fit of a given generalized linear model are the scaled deviance and Pearson's chi-square statistic. For a fixed value of the dispersion parameter $\phi$ , the scaled deviance is defined to be twice the difference between the maximum achievable log likelihood and the log likelihood at the maximum likelihood estimates of the regression parameters.

Note that these statistics are not valid for GEE models.

If $l(y, {\mu})$ is the log-likelihood function expressed as a function of the predicted mean values ${\mu}$ and the vector y of response values, then the scaled deviance is defined by

$D^*(y, {\mu}) = 2(l(y,y) - l(y, {\mu}))$

For specific distributions, this can be expressed as

$D^*(y, {\mu}) = \frac{D(y, {\mu})}{\phi}$

where D is the deviance. The following table displays the deviance for each of the probability distributions available in PROC GENMOD.

Distribution	Deviance
normal	$\sum_i w_i (y_i - \mu_i)^2$
Poisson	$2\sum_i w_i [ y_i \log ( \frac{y_i}{\mu_i} ) - (y_i - \mu_i) ]$
binomial	$2\sum_i w_i m_i [ y_i \log ( \frac{y_i}{\mu_i} ) + (1-y_i) \log ( \frac{1-y_i}{1-\mu_i} ) ]$
gamma	$2\sum_i w_i [ -\log ( \frac{y_i}{\mu_i} ) + \frac{y_i - \mu_i}{\mu_i} ]$
inverse Gaussian	$\sum_i \frac{w_i(y_i - \mu_i)^2} {\mu^2_i y_i}$
multinomial	$\sum_i\sum_j w_i y_{ij}\log(\frac{y_{ij}}{p_{ij}m_i})$
negative binomial	$2\sum_i w_i[ y\log(y/\mu) - (y+1/k)\log(\frac{y+1/k}{\mu+1/k})]$

In the binomial case, y_i=r_i/m_i, where r_i is a binomial count and m_i is the binomial number of trials parameter.

In the multinomial case, y_ij refers to the observed number of occurrences of the jth category for the ith subpopulation defined by the AGGREGATE= variable, m_i is the total number in the ith subpopulation, and p_ij is the category probability.

Pearson's chi-square statistic is defined as

$X^2 = \sum_i \frac{w_i( y_i - \mu_i)^2}{V(\mu_i)}$

and the scaled Pearson's chi-square is $X^2 / \phi$ .

The scaled version of both of these statistics, under certain regularity conditions, has a limiting chi-square distribution, with degrees of freedom equal to the number of observations minus the number of parameters estimated. The scaled version can be used as an approximate guide to the goodness of fit of a given model. Use caution before applying these statistics to ensure that all the conditions for the asymptotic distributions hold. McCullagh and Nelder (1989) advise that differences in deviances for nested models can be better approximated by chi-square distributions than the deviances themselves.

In cases where the dispersion parameter is not known, an estimate can be used to obtain an approximation to the scaled deviance and Pearson's chi-square statistic. One strategy is to fit a model that contains a sufficient number of parameters so that all systematic variation is removed, estimate $\phi$ from this model, and then use this estimate in computing the scaled deviance of sub-models. The deviance or Pearson's chi-square divided by its degrees of freedom is sometimes used as an estimate of the dispersion parameter $\phi$ .For example, since the limiting chi-square distribution of the scaled deviance $D^* = D / \phi$ has n-p degrees of freedom, where n is the number of observations and p the number of parameters, equating D^* to its mean and solving for $\phi$ yields $\hat{\phi} = D/(n-p)$ .Similarly, an estimate of $\phi$ based on Pearson's chi-square X² is $\hat{\phi} = X^2/(n-p)$ .Alternatively, a maximum likelihood estimate of $\phi$ can be computed by the procedure, if desired. See the discussion in the "Type 1 Analysis" section for more on the estimation of the dispersion parameter.

Dispersion Parameter

There are several options available in PROC GENMOD for handling the exponential distribution dispersion parameter. The NOSCALE and SCALE options in the MODEL statement affect the way in which the dispersion parameter is treated. If you specify the SCALE=DEVIANCE option, the dispersion parameter is estimated by the deviance divided by its degrees of freedom. If you specify the SCALE=PEARSON option, the dispersion parameter is estimated by Pearson's chi-square statistic divided by its degrees of freedom.

Otherwise, values of the SCALE and NOSCALE options and the resultant actions are displayed in the following table.

NOSCALE	SCALE=value	Action
present	present	scale fixed at value
present	not present	scale fixed at 1
not present	not present	scale estimated by ML
not present	present	scale estimated by ML,
		starting point at value

The meaning of the scale parameter displayed in the "Analysis Of Parameter Estimates" table is different for the Gamma distribution than for the other distributions. The relation of the scale parameter as used by PROC GENMOD to the exponential family dispersion parameter $\phi$ is displayed in the following table. For the binomial and Poisson distributions, $\phi$ is the overdispersion parameter, as defined in the "Overdispersion" section, which follows.

Distribution	Scale
normal	$\sqrt{\phi}$
inverse Gaussian	$\sqrt{\phi}$
gamma	$1/\phi$
binomial	$\sqrt{\phi}$
Poisson	$\sqrt{\phi}$

In the case of the negative binomial distribution, PROC GENMOD reports the "dispersion" parameter estimated by maximum likelihood. This is the negative binomial parameter k defined in the "Response Probability Distributions" section.

Overdispersion

Overdispersion is a phenomenon that sometimes occurs in data that are modeled with the binomial or Poisson distributions. If the estimate of dispersion after fitting, as measured by the deviance or Pearson's chi-square, divided by the degrees of freedom, is not near 1, then the data may be overdispersed if the dispersion estimate is greater than 1 or underdispersed if the dispersion estimate is less than 1. A simple way to model this situation is to allow the variance functions of these distributions to have a multiplicative overdispersion factor $\phi$ .

binomial : $V(\mu) = \phi \mu(1-\mu)$
Poisson : $V(\mu) = \phi\mu$

The models are fit in the usual way, and the parameter estimates are not affected by the value of $\phi$ .The covariance matrix, however, is multiplied by $\phi$ , and the scaled deviance and log likelihoods used in likelihood ratio tests are divided by $\phi$ .The profile likelihood function used in computing confidence intervals is also divided by $\phi$ .If you specify an WEIGHT statement, $\phi$ is divided by the value of the WEIGHT variable for each observation. This has the effect of multiplying the contributions of the log-likelihood function, the gradient, and the Hessian by the value of the WEIGHT variable for each observation.

The SCALE= option in the MODEL statement enables you to specify a value of ${\sigma = \sqrt{\phi}}$ for the binomial and Poisson distributions. If you specify the SCALE=DEVIANCE option in the MODEL statement, the procedure uses the deviance divided by degrees of freedom as an estimate of $\phi$ ,and all statistics are adjusted appropriately. You can use Pearson's chi-square instead of the deviance by specifying the SCALE=PEARSON option.

The function obtained by dividing a log-likelihood function for the binomial or Poisson distribution by a dispersion parameter is not a legitimate log-likelihood function. It is an example of a quasi-likelihood function. Most of the asymptotic theory for log likelihoods also applies to quasi-likelihoods, which justifies computing standard errors and likelihood ratio statistics using quasi-likelihoods instead of proper log likelihoods. Refer to McCullagh and Nelder (1989, Chapter 9) and McCullagh (1983) for details on quasi-likelihood functions.

Although the estimate of the dispersion parameter is often used to indicate overdispersion or underdispersion, this estimate may also indicate other problems such as an incorrectly specified model or outliers in the data. You should carefully assess whether this type of model is appropriate for your data.

Chapter Contents
Previous
Next
Top