Computational Formulas

The TPSPLINE Procedure

Computational Formulas

The theoretical foundations for the thin-plate smoothing spline are described in Duchon (1976, 1977) and Meinguet (1979). Further results and applications are given in Wahba and Wendelberger (1980), Hutchinson and Bischof (1983), and Seaman and Hutchinson (1985).

Suppose that H_m is a space of functions whose partial derivatives of total order m are in L₂(E^d) where E^d is the domain of x.

Now, consider the data model

$y_i = f(x_1(i), ... , x_d(i))+\epsilon_i, i=1, ... , n$

where $f \in {\cal H}_m$ .

Using the notation from the section "The Penalized Least Squares Estimate", for a fixed $\lambda$ ,estimate f by minimizing the penalized least squares function

$\frac{1}n \sum^n_{i=1}(y_i-f({x}_i)-{z}_i{{\beta}})^2 + \lambda J_m(f)$

There are several ways to define J_m(f). For the thin-plate smoothing spline, with x of dimension d, define J_m(f) as

$J_m(f)= \int_{-\infty}^{\infty} ... \int_{-\infty}^{\infty} \sum \frac{m!}{... ...} {\partial{x_1}^{\alpha_1} ... \partial{x_d}^{\alpha_d}} ]^2 dx_1 ... dx_d$

where $\sum_i \alpha_i=m$ .

When d=2 and m=2, J_m(f) is as follows:

$J_2(f)=\int_{-\infty} ^{\infty} \int_{-\infty}^{\infty} ([ \frac{\partial^2 f}{... ..._1} \partial {x_2}} ]^2 + [ \frac{\partial^2 f}{\partial {x_2}^2} ]^2) dx_1dx_2$

In general, m and d must satisfy the condition that 2m - d > 0. For the sake of simplicity, the formulas and equations that follow assume m=2. Refer to Wahba (1990) and Bates et al. (1987) for more details.

Duchon (1976) showed that $f_\lambda$ can be represented as

$f_\lambda(x_i)=\theta_0+\sum_{j=1}^d \theta_j x_{ij}+ \sum_{j=1}^n\delta_j E_2(x_i-x_j)$

where $E_2(s)=\frac{1}{2^3 \pi} ||{s}||^2ln(||{s}||).$

If you define K = (K)_ij = E₂(x_i-x_j) and T = (T)_ij = (x_ij), the goal is to find coefficients $\beta,\theta,$ and $\delta$ that minimize

$S_\lambda(\beta, \theta,\delta)= \frac{1}n ||{y}-T\theta - K\delta-Z\beta||^2 +\lambda \delta^T K \delta$

A unique solution is guaranteed if the matrix T is of full rank and $\delta^T K \delta \ge 0$ .

If $\alpha = ( \theta \ \beta )$ and X = (T:Z), the expression for $S_\lambda$ becomes

$\frac{1}n||{y}-X\alpha- K\delta||^2 +\lambda \delta^T K \delta$

The coefficients $\alpha$ and $\delta$ can be obtained by solving

$. (K+n\lambda I_n) \delta +X\alpha &= & y \ X^T \delta &= & 0 .$

To compute $\alpha$ and $\delta$ , let the QR decomposition of X be

$X = (Q_1 : Q_2) ( R \ 0 )$

where (Q₁:Q₂) is an orthogonal matrix and R is upper triangular, with X^T Q₂ = 0 (Dongarra et al. 1979).

Since $X^T \delta = 0$ , $\delta$ must be in the column space of Q₂. Therefore, $\delta$ can be expressed as $\delta = Q_2 \gamma$ for a vector $\gamma$ . Substituting $\delta = Q_2 \gamma$ into the preceding equation and multiplying through by Q₂ ^T gives

$Q_2 ^T (K+n\lambda I_n)Q_2 \gamma = Q_2^T y$

$\delta = Q_2 \gamma = Q_2[Q_2^T(K+n\lambda I_n)Q_2]^{-1} Q_2^T y$

The coefficient $\alpha$ can be obtained by solving

$R\alpha = Q_1^T[ y- (K+n\lambda{I}_n)\delta]$

The influence matrix $A(\lambda)$ is defined as

$\hat{y}={A}(\lambda) {y}$

and has the form

$A(\lambda) = I - n\lambda Q_2 [Q_2^T(K+n\lambda{I}_n)Q_2]^{-1} Q_2^T$

Similar to the regression case, and if you consider the trace of $A(\lambda)$ as the degrees of freedom for the information signal and the trace of $(I_n - A(\lambda))$ as the degrees of freedom for the noise component, the estimate $\sigma ^2$ can be represented as

$\hat{\sigma}^2 = \frac{RSS(\lambda)}{Trace(I_n - A(\lambda))}$

where $RSS(\lambda)$ is the residual sum of squares. Theoretical properties of these estimates have not yet been published. However, good numerical results in simulation studies have been described by several authors. For more information, refer to O'Sullivan and Wong (1987), Nychka (1986a, 1986b, and 1988), and Hall and Titterington (1987).

Confidence Intervals

Viewing the spline model as a Bayesian model, Wahba (1983) proposed Bayesian confidence intervals for smoothing spline estimates as follows:

$\hat{f}_\lambda(x_i) +- z_{\alpha/2} \sqrt{\hat{\sigma}^2 a_{ii}(\lambda)}$

where $a_{ii}(\lambda)$ is the ith diagonal element of the $A(\lambda)$ matrix and $z_{\alpha/2}$ is the $\alpha/2$ point of the normal distribution. The confidence intervals are interpreted as intervals "across the function" as opposed to point-wise intervals.

Suppose that you fit a spline estimate to experimental data that consists of a true function f and a random error term, $\epsilon_i$ . In repeated experiments, it is likely that about $100(1-\alpha)\%$ of the confidence intervals cover the corresponding true values, although some values are covered every time and other values are not covered by the confidence intervals most of the time. This effect is more pronounced when the true surface or surface has small regions of particularly rapid change.

Smoothing Parameter

The quantity $\lambda$ is called the smoothing parameter, which controls the balance between the goodness of fit and the smoothness of the final estimate.

A large $\lambda$ heavily penalizes the mth derivative of the function, thus forcing f^(m) close to 0. The final estimating function satisfies f^(m)(x) = 0. A small $\lambda$ places less of a penalty on rapid change in f^(m)(x), resulting in an estimate that tends to interpolate the data points.

The smoothing parameter greatly affects the analysis, and it should be selected with care. One method is to perform several analyses with different values for $\lambda$ and compare the resulting final estimates.

A more objective way to select the smoothing parameter $\lambda$ is to use the "leave-out-one" cross validation function, which is an approximation of the predicted mean squares error. A generalized version of the leave-out-one cross validation function is proposed by Wahba (1990) and is easy to calculate. This Generalized Cross Validation (GCV) function $(V(\lambda))$ is defined as

$V(\lambda)=\frac{(1/n)||({I}-{A}(\lambda)){y}||^2} {[(1/n)tr({I}-{A}(\lambda))]^2}$

The justification for using the GCV function to select $\lambda$ relies on asymptotic theory. Thus, you cannot expect good results for very small sample sizes or when there is not enough information in the data to separate the information signal from the noise component. Simulation studies suggest that for independent and identically distributed Gaussian noise, you can obtain reliable estimates of $\lambda$ for n greater than 25 or 30. Note that, even for large values of n (say $n \ge 50$ ), in extreme Monte Carlo simulations there may be a small percentage of unwarranted extreme estimates in which $\hat{\lambda} = 0$ or $\hat{\lambda} = \infty$ (Wahba 1983). Generally, if $\sigma ^2$ is known to within an order of magnitude, the occasional extreme case can be readily identified. As n gets larger, the effect becomes weaker.

The GCV function is fairly robust against nonhomogeneity of variances and non-Gaussian errors (Villalobos and Wahba 1987). Andrews (1988) has provided favorable theoretical results when variances are unequal. However, this selection method is likely to give unsatisfactory results when the errors are highly correlated.

The GCV value may be suspect when $\lambda$ is extremely small because computed values may become indistinguishable from zero. In practice, calculations with $\lambda = 0$ or $\lambda$ near 0 can cause numerical instabilities resulting in an unsatisfactory solution. Simulation studies have shown that a $\lambda$ with $log_{10}(n\lambda) \gt -8$ is small enough that the final estimate based on this $\lambda$ almost interpolates the data points. A GCV value based on a $\lambda \le -8$ may not be accurate.

Chapter Contents
Previous
Next
Top