Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The KDE Procedure

Binning

Binning, or assigning data to discrete categories, is an effective and fast method for large data sets (Fan and Marron 1994). When the sample size n is large, direct evaluation of the kernel estimate \hat f at any point would involve n kernel evaluations as shown in the preceding formulas. To evaluate the estimate at each point of a grid of size g would thus require ng kernel evaluations. When you use g = 401 in the univariate case or g = 60×60 = 3600 in the bivariate case and n\geq1000, the amount of computation can be prohibitively large. With binning, however, the computational order is reduced to g, resulting in a much quicker algorithm that is practically as accurate as direct evaluation.

To bin a set of weighted univariate data X1,X2, ... ,Xn to a grid x1, x2, ... ,xg, simply assign each sample Xi, together with its weight Wi, to the nearest grid point xj (also called the bin center). When binning is completed, each grid point xi has an associated number ci, which is the sum total of all the weights that correspond to sample points that have been assigned to xi. These cis are known as the "bin counts."

This procedure replaces the data (Xi,Wi), i = 1,2, ... ,n with the smaller set (xi,ci), i = 1,2, ... ,g, and the estimation is carried out with this new data. This is so-called "simple binning," as versus the finer "linear binning" described in Wand (1993). PROC KDE uses simple binning for the sake of faster and easier implementation. Also, it is assumed that the bin centers x1,x2, ... ,xg are equally spaced and in increasing order. In addition, assume for notational convenience that \sum_{i=1}^nW_{i} = n and, therefore, \sum_{i=1}^gc_{i} = n.

If you replace the data (Xi,Wi), i = 1,2, ... ,n with (xi,ci), i = 1,2, ... ,g, the weighted estimator \hat f then becomes

\hat{f}(x) = \frac{1}n\sum_{i=1}^g c_{i}
 \varphi_{h}(x-x_{i})
with the same notation as used previously. To evaluate this estimator at the g points of the same grid vector grid = (x1,x2, ... ,xg)' is to calculate
\hat{f}(x_{i}) = \frac{1}n\sum_{j=1}^g c_{j}
 \varphi_{h}(x_{i}-x_{j})
for i = 1,2, ... ,g. This can be rewritten as
\hat{f}(x_{i}) = \frac{1}n\sum_{j=1}^g c_{j}
 \varphi_{h}(| i-j|\delta)
where \delta = x_{2}-x_{1} is the increment of the grid.

The same idea of binning works similarly with bivariate data, where you estimate \hat f over the grid matrix grid = gridX×gridY as follows.

grid = [ x_{1,1} & x_{1,2} &
  ...  & x_{1,g_{Y}} \ x_{2,1} & x_{2,2} &
  ...  & x_{2,g_{Y}} \ \vdots \ x_{g_{X},1} & x_{g_{X},2} &
  ...  & x_{g_{X},g_{Y}}
 ]
where xi,j = (xi,yi), i = 1,2, ... ,gX, j = 1,2, ... ,gY, and the estimates are
\hat{f}(x_{i,j}) = \frac{1}n\sum_{k=1}^{g_{X}}
 \sum_{l=1}^{g_{Y}}
 c_{k,l}\varphi_{h}(| i-k|\delta_{X},| j-l|\delta_{Y})
where \delta_{X} = x_{2}-x_{1} and \delta_{Y} = y_{2}-y_{1} are the increments of the grid.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.