Troubleshooting Convergence Problems

The MODEL Procedure

Troubleshooting Convergence Problems

As with any nonlinear estimation routine, there is no guarantee that the estimation will be successful for a given model and data. If the equations are linear with respect to the parameters, the parameter estimates always converge in one iteration. The methods that iterate the S matrix must iterate further for the S matrix to converge. Nonlinear models may not necessarily converge.

Convergence can be expected only with fully identified parameters, adequate data, and starting values sufficiently close to solution estimates.

Convergence and the rate of convergence may depend primarily on the choice of starting values for the estimates. This does not mean that a great deal of effort should be invested in choosing starting values. First, try the default values. If the estimation fails with these starting values, examine the model and data and re-run the estimation using reasonable starting values. It is usually not necessary that the starting values be very good, just that they not be very bad; choose values that seem plausible for the model and data.

An Example of Requiring Starting Values

Suppose you want to regress a variable Y on a variable X assuming that the variables are related by the following nonlinear equation:

$y = a + b x^c + {\epsilon}$

In this equation, Y is linearly related to a power transformation of X. The unknown parameters are a, b, and c. ${\epsilon}$ is an unobserved random error. Some simulated data was generated using the following SAS statements. In this simulation, a=10, b=2, and the use of the SQRT function corresponds to c=.5.

   data test;
     do i = 1 to 20; 
        x = 5 * ranuni(1234); 
        y = 10 + 2 * sqrt(x) + .5 * rannor(2345);
        output;
        end;
   run;

The following statements specify the model and give descriptive labels to the model parameters. Then the FIT statement attempts to estimate a, b, and c using the default starting value .0001.

   proc model data=test;
      y = a + b * x ** c;
      label a = "Intercept"
            b = "Coefficient of Transformed X"
            c = "Power Transformation Parameter";
      fit y;
   run;

PROC MODEL prints model summary and estimation problem summary reports and then prints the output shown in Figure 14.17.

The MODEL Procedure

OLS Estimation

NOTE: The iteration limit is exceeded for OLS.

ERROR: The parameter estimates failed to converge for OLS after 100 iterations using CONVERGE=0.001 as the convergence criteria.

The MODEL Procedure

OLS Estimation

	Iteration	N Obs	R	Objective	N Subit	a	b	c
OLS	100	20	0.9627	3.9678	2	137.3844	-126.536	-0.00213

Gauss Method Parameter Change Vector
a	b	c
-69367.57	69366.51	-1.16

NOTE: The parameter estimation is abandoned. Check your model and data. If the model is correct and the input data are appropriate, try rerunning the parameter estimation using different starting values for the parameter estimates.

PROC MODEL continues as if the parameter estimates had converged.

Figure 14.17: Diagnostics for Convergence Failure

By using the default starting values, PROC MODEL was unable to take even the first step in iterating to the solution. The change in the parameters that the Gauss-Newton method computes is very extreme and makes the objective values worse instead of better. Even when this step is shortened by a factor of a million, the objective function is still worse, and PROC MODEL is unable to estimate the model parameters.

The problem is caused by the starting value of C. Using the default starting value C=.0001, the first iteration attempts to compute better values of A and B by what is, in effect, a linear regression of Y on the 10,000th root of X, which is almost the same as the constant 1. Thus the matrix that is inverted to compute the changes is nearly singular and affects the accuracy of the computed parameter changes.

This is also illustrated by the next part of the output, which displays collinearity diagnostics for the crossproducts matrix of the partial derivatives with respect to the parameters, shown in Figure 14.18.

The MODEL Procedure

OLS Estimation

Collinearity Diagnostics
Number	Eigenvalue	Condition Number	Proportion of Variation
Number	Eigenvalue	Condition Number	a	b	c
1	2.376793	1.0000	0.0000	0.0000	0.0000
2	0.623207	1.9529	0.0000	0.0000	0.0000
3	1.684616E-12	1187805	1.0000	1.0000	1.0000

Figure 14.18: Collinearity Diagnostics

This output shows that the matrix is singular and that the partials of A, B, and C with respect to the residual are collinear at the point ( 0.0001, 0.0001, 0.0001 ) in the parameter space. See the section "Linear Dependencies" for a full explanation of the collinearity diagnostics.

The MODEL procedure next prints the note shown in Figure 14.19, which suggests that you try different starting values.

The MODEL Procedure

OLS Estimation

PROC MODEL continues as if the parameter estimates had converged.

Figure 14.19: Estimation Failure Note

PROC MODEL then produces the usual printout of results for the nonconverged parameter values. The estimation summary is shown in Figure 14.20. The heading includes the reminder "(Not Converged)."

The MODEL Procedure

OLS Estimation

Collinearity Diagnostics
Number	Eigenvalue	Condition Number	Proportion of Variation
Number	Eigenvalue	Condition Number	a	b	c
1	2.376793	1.0000	0.0000	0.0000	0.0000
2	0.623207	1.9529	0.0000	0.0000	0.0000
3	1.684616E-12	1187805	1.0000	1.0000	1.0000

The MODEL Procedure

OLS Estimation Summary (Not Converged)

Minimization Summary
Parameters Estimated	3
Method	Gauss
Iterations	100
Subiterations	239
Average Subiterations	2.39

Final Convergence Criteria
R	0.962666
PPC(b)	548.1977
RPC(b)	540.4224
Object	2.633E-6
Trace(S)	4.667947
Objective Value	3.967755

Observations Processed
Read	20
Solved	20

Figure 14.20: Nonconverged Estimation Summary

The nonconverged estimation results are shown in Figure 14.21.

The MODEL Procedure

Nonlinear OLS Summary of Residual Errors (Not Converged)
Equation	DF Model	DF Error	SSE	MSE	Root MSE	R-Square	Adj R-Sq
y	3	17	79.3551	4.6679	2.1605	-1.6812	-1.9966

Nonlinear OLS Parameter Estimates (Not Converged)
Parameter	Estimate	Approx Std Err	t Value	Approx Pr > \|t\|	Label
a	137.3844	263342	0.00	0.9996	Intercept
b	-126.536	263342	-0.00	0.9996	Coefficient of Transformed X
c	-0.00213	4.4371	-0.00	0.9996	Power Transformation Parameter

Figure 14.21: Nonconverged Results

Note that the R² statistic is negative. An R² < 0 results when the residual mean square error for the model is larger than the variance of the dependent variable. Negative R² statistics may be produced when either the parameter estimates fail to converge correctly, as in this case, or when the correctly estimated model fits the data very poorly.

Controlling Starting Values

To fit the preceding model you must specify a better starting value for C. Avoid starting values of C that are either very large or close to 0. For starting values of A and B, you can either specify values, use the default, or have PROC MODEL fit starting values for them conditional on the starting value for C.

Starting values are specified with the START= option of the FIT statement or on a PARMS statement. For example, the following statements estimate the model parameters using the starting values A=.0001, B=.0001, and C=5.

   proc model data=test;
      y = a + b * x ** c;
      label a = "Intercept"
            b = "Coefficient of Transformed X"
            c = "Power Transformation Parameter";
      fit y start=(c=5);
   run;

Using these starting values, the estimates converge in 16 iterations. The results are shown in Figure 14.22. Note that since the START= option explicitly declares parameters, the parameter C is placed first in the table.

The MODEL Procedure

Nonlinear OLS Summary of Residual Errors
Equation	DF Model	DF Error	SSE	MSE	Root MSE	R-Square	Adj R-Sq
y	3	17	5.7359	0.3374	0.5809	0.8062	0.7834

Nonlinear OLS Parameter Estimates
Parameter	Estimate	Approx Std Err	t Value	Approx Pr > \|t\|	Label
c	0.327079	0.2892	1.13	0.2738	Power Transformation Parameter
a	8.384311	3.3775	2.48	0.0238	Intercept
b	3.505391	3.4858	1.01	0.3287	Coefficient of Transformed X

Figure 14.22: Converged Results

Using the STARTITER Option

PROC MODEL can compute starting values for some parameters conditional on starting values you specify for the other parameters. You supply starting values for some parameters and specify the STARTITER option on the FIT statement.

For example, the following statements set C to 1 and compute starting values for A and B by estimating these parameters conditional on the fixed value of C. With C=1 this is equivalent to computing A and B by linear regression on X. A PARMS statement is used to declare the parameters in alphabetical order. The ITPRINT option is used to print the parameter values at each iteration.

   proc model data=test;
      parms a b c;
      y = a + b * x ** c;
      label a = "Intercept"
            b = "Coefficient of Transformed X"
            c = "Power Transformation Parameter";
      fit y start=(c=1) / startiter itprint;
   run;

With better starting values, the estimates converge in only 5 iterations. Counting the 2 iterations required to compute the starting values for A and B, this is 5 fewer than the 12 iterations required without the STARTITER option. The iteration history listing is shown in Figure 14.23.

The MODEL Procedure

OLS Estimation

	Iteration	N Obs	R	Objective	N Subit	a	b	c
GRID	0	20	0.9970	161.9	0	0.00010	0.00010	5.00000
GRID	1	20	0.0000	0.9675	0	12.29508	0.00108	5.00000

	Iteration	N Obs	R	Objective	N Subit	a	b	c
OLS	0	20	0.6551	0.9675	0	12.29508	0.00108	5.00000
OLS	1	20	0.6882	0.9558	4	12.26426	0.00201	4.44013
OLS	2	20	0.6960	0.9490	4	12.25554	0.00251	4.28262
OLS	3	20	0.7058	0.9428	2	12.24487	0.00323	4.09977
OLS	4	20	0.7177	0.9380	2	12.23186	0.00430	3.89040
OLS	5	20	0.7317	0.9354	2	12.21610	0.00592	3.65450
OLS	6	20	0.7376	0.9289	3	12.20663	0.00715	3.52417
OLS	7	20	0.7445	0.9223	2	12.19502	0.00887	3.37407
OLS	8	20	0.7524	0.9162	2	12.18085	0.01130	3.20393
OLS	9	20	0.7613	0.9106	2	12.16366	0.01477	3.01460
OLS	10	20	0.7705	0.9058	2	12.14298	0.01975	2.80839
OLS	11	20	0.7797	0.9015	2	12.11827	0.02690	2.58933
OLS	12	20	0.7880	0.8971	2	12.08900	0.03712	2.36306
OLS	13	20	0.7947	0.8916	2	12.05460	0.05152	2.13650
OLS	14	20	0.7993	0.8835	2	12.01449	0.07139	1.91695
OLS	15	20	0.8015	0.8717	2	11.96803	0.09808	1.71101
OLS	16	20	0.8013	0.8551	2	11.91459	0.13284	1.52361
OLS	17	20	0.7987	0.8335	2	11.85359	0.17666	1.35745
OLS	18	20	0.8026	0.8311	1	11.71551	0.28373	1.06872
OLS	19	20	0.7945	0.7935	2	11.57666	0.40366	0.89662
OLS	20	20	0.7872	0.7607	1	11.29346	0.65999	0.67059
OLS	21	20	0.7632	0.6885	1	10.81372	1.11483	0.48842
OLS	22	20	0.6976	0.5587	0	9.54889	2.34556	0.30461
OLS	23	20	0.0108	0.2868	0	8.44333	3.44826	0.33232
OLS	24	20	0.0008	0.2868	0	8.39438	3.49500	0.32790

NOTE: At OLS Iteration 24 CONVERGE=0.001 Criteria Met.

Figure 14.23: ITPRINT Listing

The results produced in this case are almost the same as the results shown in Figure 14.22, except that the PARMS statement causes the Parameter Estimates table to be ordered A, B, C instead of C, A, B. They are not exactly the same because the different starting values caused the iterations to converge at a slightly different place. This effect is controlled by changing the convergence criterion with the CONVERGE= option.

By default, the STARTITER option performs one iteration to find starting values for the parameters not given values. In this case the model is linear in A and B, so only one iteration is needed. If A or B were nonlinear, you could specify more than one "starting values" iteration by specifying a number for the STARTITER= option.

Finding Starting Values by Grid Search

PROC MODEL can try various combinations of parameter values and use the combination producing the smallest objective function value as starting values. (For OLS the objective function is the residual mean square.) This is known as a preliminary grid search. You can combine the STARTITER option with a grid search.

For example, the following statements try 5 different starting values for C: 10, 5, 2.5, -2.5, -5. For each value of C, values for A and B are estimated. The combination of A, B, and C values producing the smallest residual mean square is then used to start the iterative process.

   proc model data=test;
      parms a b c;
      y = a + b * x ** c;
      label a = "Intercept"
            b = "Coefficient of Transformed X"
            c = "Power Transformation Parameter";
      fit y start=(c=10 5 2.5 -2.5 -5) / startiter itprint;
   run;

The iteration history listing is shown in Figure 14.24. Using the best starting values found by the grid search, the OLS estimation only requires 2 iterations. However, since the grid search required 10 iterations, the total iterations in this case is 12.

The MODEL Procedure

OLS Estimation

	Iteration	N Obs	R	Objective	N Subit	a	b	c
GRID	0	20	1.0000	26815.5	0	0.00010	0.00010	10.00000
GRID	1	20	0.0000	1.2193	0	12.51792	0.00000	10.00000
GRID	0	20	0.6012	1.5151	0	12.51792	0.00000	5.00000
GRID	1	20	0.0000	0.9675	0	12.29508	0.00108	5.00000
GRID	0	20	0.7804	1.6091	0	12.29508	0.00108	2.50000
GRID	1	20	0.0000	0.6290	0	11.87327	0.06372	2.50000
GRID	0	20	0.8779	4.1604	0	11.87327	0.06372	-2.50000
GRID	1	20	0.0000	0.9542	0	12.92455	-0.04700	-2.50000
GRID	0	20	0.9998	2776.1	0	12.92455	-0.04700	-5.00000
GRID	1	20	0.0000	1.0450	0	12.86129	-0.00060	-5.00000

	Iteration	N Obs	R	Objective	N Subit	a	b	c
OLS	0	20	0.6685	0.6290	0	11.87327	0.06372	2.50000
OLS	1	20	0.6649	0.5871	3	11.79268	0.10083	2.11710
OLS	2	20	0.6713	0.5740	2	11.71445	0.14901	1.81658
OLS	3	20	0.6726	0.5621	2	11.63772	0.20595	1.58705
OLS	4	20	0.6678	0.5471	2	11.56098	0.26987	1.40903
OLS	5	20	0.6587	0.5295	2	11.48317	0.33953	1.26760
OLS	6	20	0.6605	0.5235	1	11.32436	0.48846	1.03784
OLS	7	20	0.6434	0.4997	2	11.18704	0.62475	0.90793
OLS	8	20	0.6294	0.4805	1	10.93520	0.87965	0.73319
OLS	9	20	0.6031	0.4530	1	10.55670	1.26879	0.57385
OLS	10	20	0.6052	0.4526	0	9.62442	2.23114	0.36146
OLS	11	20	0.1652	0.2948	0	8.56683	3.31774	0.32417
OLS	12	20	0.0008	0.2868	0	8.38015	3.50974	0.32664

NOTE: At OLS Iteration 12 CONVERGE=0.001 Criteria Met.

Figure 14.24: ITPRINT Listing

Because no initial values for A or B were provided in the PARAMETERS statement or were read in with a PARMSDATA= or ESTDATA= option, A and B were given the default value of 0.0001 for the first iteration. At the second grid point, C=5, the values of A and B obtained from the previous iterations are used for the initial iteration. If initial values are provided for parameters, the parameters start at those initial values at each grid point.

Guessing Starting Values from the Logic of the Model

Example 14.1 of the logistic growth curve model of the U.S. population illustrates the need for reasonable starting values. This model can be written

pop = [a/(1+exp(b-c(t-1790)))]

where t is time in years. The model is estimated using decennial census data of the U.S. population in millions. If this simple but highly nonlinear model is estimated using the default starting values, the estimation fails to converge.

To find reasonable starting values, first consider the meaning of a and c. Taking the limit as time increases, a is the limiting or maximum possible population. So, as a starting value for a, several times the most recent population known can be used, for example, one billion (1000 million).

Dividing the time derivative by the function to find the growth rate and taking the limit as t moves into the past, you can determine that c is the initial growth rate. You can examine the data and compute an estimate of the growth rate for the first few decades, or you can pick a number that sounds like a plausible population growth rate figure, such as 2%.

To find a starting value for b, let t equal the base year used, 1790, which causes c to drop out of the formula for that year, and then solve for the value of b that is consistent with the known population in 1790 and with the starting value of a. This yields b = ln(a/3.9-1) or about 5.5, where a is 1000 and 3.9 is roughly the population for 1790 given in the data. The estimates converge using these starting values.

Convergence Problems

When estimating nonlinear models, you may encounter some of the following convergence problems.

Unable to Improve

The optimization algorithm may be unable to find a step that improves the objective function. If this happens in the Gauss-Newton method, the step size is halved to find a change vector for which the objective improves. In the Marquardt method, ${\lambda}$ will be increased to find a change vector for which the objective improves. If, after MAXSUBITER= step-size halvings or increases in ${\lambda}$ , the change vector still does not produce a better objective value, the iterations are stopped and an error message is printed.

Failure of the algorithm to improve the objective value can be caused by a CONVERGE= value that is too small. Look at the convergence measures reported at the point of failure. If the estimates appear to be approximately converged, you can accept the NOT CONVERGED results reported, or you can try re-running the FIT task with a larger CONVERGE= value.

If the procedure fails to converge because it is unable to find a change vector that improves the objective value, check your model and data to ensure that all parameters are identified and data values are reasonably scaled. Then, re-run the model with different starting values. Also, consider using the Marquardt method if Gauss-Newton fails; the Gauss-Newton method can get into trouble if the Jacobian matrix is nearly singular or ill-conditioned. Keep in mind that a nonlinear model may be well-identified and well-conditioned for parameter values close to the solution values but unidentified or numerically ill-conditioned for other parameter values. The choice of starting values can make a big difference.

Nonconvergence

The estimates may diverge into areas where the program overflows or the estimates may go into areas where function values are illegal or too badly scaled for accurate calculation. The estimation may also take steps that are too small or that make only marginal improvement in the objective function and, thus, fail to converge within the iteration limit.

When the estimates fail to converge, collinearity diagnostics for the Jacobian crossproducts matrix are printed if there are 20 or fewer parameters estimated. See "Linear Dependencies" later in this section for an explanation of these diagnostics.

Inadequate Convergence Criterion

If convergence is obtained, the resulting estimates will only approximate a minimum point of the objective function. The statistical validity of the results is based on the exact minimization of the objective function, and for nonlinear models the quality of the results depends on the accuracy of the approximation of the minimum. This is controlled by the convergence criterion used.

There are many nonlinear functions for which the objective function is quite flat in a large region around the minimum point so that many quite different parameter vectors may satisfy a weak convergence criterion. By using different starting values, different convergence criteria, or different minimization methods, you can produce very different estimates for such models.

You can guard against this by running the estimation with different starting values and different convergence criteria and checking that the estimates produced are essentially the same. If they are not, use a smaller CONVERGE= value.

Local Minimum

You may have converged to a local minimum rather than a global one. This problem is difficult to detect because the procedure will appear to have succeeded. You can guard against this by running the estimation with different starting values or with a different minimization technique. The START= option can be used to automatically perform a grid search to aid in the search for a global minimum.

Discontinuities

The computational methods assume that the model is a continuous and smooth function of the parameters. If this is not the case, the methods may not work.

If the model equations or their derivatives contain discontinuities, the estimation will usually succeed, provided that the final parameter estimates lie in a continuous interval and that the iterations do not produce parameter values at points of discontinuity or parameter values that try to cross asymptotes.

One common case of discontinuities causing estimation failure is that of an asymptotic discontinuity between the final estimates and the initial values. For example, consider the following model, which is basically linear but is written with one parameter in reciprocal form:

   y = a + b * x1 + x2 / c;

By placing the parameter C in the denominator, a singularity is introduced into the parameter space at C=0. This is not necessarily a problem, but if the correct estimate of C is negative while the starting value is positive (or vice versa), the asymptotic discontinuity at 0 will lie between the estimate and the starting value. This means that the iterations have to pass through the singularity to get to the correct estimates. The situation is shown in Figure 14.25.

Figure 14.25: Asymptotic Discontinuity

Because of the incorrect sign of the starting value, the C estimate goes off towards positive infinity in a vain effort to get past the asymptote and onto the correct arm of the hyperbola. As the computer is required to work with ever closer approximations to infinity, the numerical calculations break down and an "objective function was not improved" convergence failure message is printed. At this point, the iterations terminate with an extremely large positive value for C. When the sign of the starting value for C is changed, the estimates converge quickly to the correct values.

Linear Dependencies

In some cases, the Jacobian matrix may not be of full rank; parameters may not be fully identified for the current parameter values with the current data. When linear dependencies occur among the derivatives of the model, some parameters appear with a standard error of 0 and with the word BIASED printed in place of the t statistic. When this happens, collinearity diagnostics for the Jacobian crossproducts matrix are printed if the DETAILS option is specified and there are twenty or fewer parameters estimated. Collinearity diagnostics are also printed out automatically when a minimization method fails, or when the COLLIN option is specified.

For each parameter, the proportion of the variance of the estimate accounted for by each principal component is printed. The principal components are constructed from the eigenvalues and eigenvectors of the correlation matrix (scaled covariance matrix). When collinearity exists, a principal component is associated with proportion of the variance of more than one parameter. The numbers reported are proportions so they will remain between 0 and 1. If two or more parameters have large proportion values associated with the same principle component, then two problems can occur: the computation of the parameter estimates are slow or nonconvergent; and the parameter estimates have inflated variances (Belsley 1980, p. 105-117).

For example, the following cubic model is fit to a quadratic data set:

   proc model data=test3;
      exogenous x1 ;
      parms b1 a1 c1 ;
      y1 = a1 * x1 + b1 * x1 * x1  + c1 * x1 * x1 *x1;
      fit y1/ collin ;
   run;

The collinearity diagnostics are shown in Figure 14.26.

The MODEL Procedure

Collinearity Diagnostics
Number	Eigenvalue	Condition Number	Proportion of Variation
Number	Eigenvalue	Condition Number	b1	a1	c1
1	2.942920	1.0000	0.0001	0.0004	0.0002
2	0.056638	7.2084	0.0001	0.0357	0.0148
3	0.000442	81.5801	0.9999	0.9639	0.9850

Figure 14.26: Collinearity Diagnostics

Notice that the proportions associated with the smallest eigenvalue are almost 1. For this model, removing any of the parameters will decrease the variances of the remaining parameters.

In many models the collinearity might not be clear cut. Collinearity is not necessarily something you remove. A model may need to be reformulated to remove the redundant parameterization or the limitations on the estimatability of the model can be accepted.

Collinearity diagnostics are also useful when an estimation does not converge. The diagnostics provide insight into the numerical problems and can suggest which parameters need better starting values. These diagnostics are based on the approach of Belsley, Kuh, and Welsch (1980).

Chapter Contents
Previous
Next
Top