Binning

The KDE Procedure

Binning

Binning, or assigning data to discrete categories, is an effective and fast method for large data sets (Fan and Marron 1994). When the sample size n is large, direct evaluation of the kernel estimate $\hat f$ at any point would involve n kernel evaluations as shown in the preceding formulas. To evaluate the estimate at each point of a grid of size g would thus require ng kernel evaluations. When you use g = 401 in the univariate case or g = 60×60 = 3600 in the bivariate case and $n\geq1000$ , the amount of computation can be prohibitively large. With binning, however, the computational order is reduced to g, resulting in a much quicker algorithm that is practically as accurate as direct evaluation.

To bin a set of weighted univariate data X₁,X₂, ... ,X_n to a grid x₁, x₂, ... ,x_g, simply assign each sample X_i, together with its weight W_i, to the nearest grid point x_j (also called the bin center). When binning is completed, each grid point x_i has an associated number c_i, which is the sum total of all the weights that correspond to sample points that have been assigned to x_i. These c_is are known as the "bin counts."

This procedure replaces the data (X_i,W_i), i = 1,2, ... ,n with the smaller set (x_i,c_i), i = 1,2, ... ,g, and the estimation is carried out with this new data. This is so-called "simple binning," as versus the finer "linear binning" described in Wand (1993). PROC KDE uses simple binning for the sake of faster and easier implementation. Also, it is assumed that the bin centers x₁,x₂, ... ,x_g are equally spaced and in increasing order. In addition, assume for notational convenience that $\sum_{i=1}^nW_{i} = n$ and, therefore, $\sum_{i=1}^gc_{i} = n$ .

If you replace the data (X_i,W_i), i = 1,2, ... ,n with (x_i,c_i), i = 1,2, ... ,g, the weighted estimator $\hat f$ then becomes

$\hat{f}(x) = \frac{1}n\sum_{i=1}^g c_{i} \varphi_{h}(x-x_{i})$

with the same notation as used previously. To evaluate this estimator at the g points of the same grid vector grid = (x₁,x₂, ... ,x_g)' is to calculate

$\hat{f}(x_{i}) = \frac{1}n\sum_{j=1}^g c_{j} \varphi_{h}(x_{i}-x_{j})$

for i = 1,2, ... ,g. This can be rewritten as

$\hat{f}(x_{i}) = \frac{1}n\sum_{j=1}^g c_{j} \varphi_{h}(| i-j|\delta)$

where $\delta = x_{2}-x_{1}$ is the increment of the grid.

The same idea of binning works similarly with bivariate data, where you estimate $\hat f$ over the grid matrix grid = grid_X×grid_Y as follows.

$grid = [ x_{1,1} & x_{1,2} & ... & x_{1,g_{Y}} \ x_{2,1} & x_{2,2} & ... & x_{2,g_{Y}} \ \vdots \ x_{g_{X},1} & x_{g_{X},2} & ... & x_{g_{X},g_{Y}} ]$

where x_i,j = (x_i,y_i), i = 1,2, ... ,g_X, j = 1,2, ... ,g_Y, and the estimates are

$\hat{f}(x_{i,j}) = \frac{1}n\sum_{k=1}^{g_{X}} \sum_{l=1}^{g_{Y}} c_{k,l}\varphi_{h}(| i-k|\delta_{X},| j-l|\delta_{Y})$

where $\delta_{X} = x_{2}-x_{1}$ and $\delta_{Y} = y_{2}-y_{1}$ are the increments of the grid.

Chapter Contents
Previous
Next
Top