Computing Quantiles

The STDIZE Procedure

Computing Quantiles

PROC STDIZE offers two methods for computing quantiles: the one-pass approach and the order-statistics approach (like that used in the UNIVARIATE procedure).

The one-pass approach used in PROC STDIZE modifies the P² algorithm for histograms proposed by Jain and Chlamtac (1985). The primary difference comes from the movement of markers. The one-pass method allows a marker to move to the right (or left) by more than one position (to the largest possible integer) as long as it does not result in two markers being in the same position. The modification is necessary in order to incorporate the FREQ variable.

You may obtain inaccurate results if you use the one-pass approach to estimate quantiles beyond the quartiles (that is, when you estimate quantiles < P25 or > P75). A large sample size (10,000 or more) is often required if the tail quantiles (quantiles <= P10 or >= P90 ) are requested. Note that, for variables with highly skewed or heavy-tailed distributions, tail quantile estimates may be inaccurate.

The order-statistics approach for estimating quantiles is faster than the one-pass method but requires that the entire data set be stored in memory. The accuracy in estimating the quantiles is comparable for both methods when the requested percentiles are between the lower and upper quartiles. The default is PCTLMTD=ORD_STAT if enough memory is available; otherwise, PCTLMTD=ONEPASS.

Computational Methods for the PCTLDEF= Option

You can specify one of five methods for computing quantile statistics when you use the order-statistics approach (PCTLMTD=ORD_STAT); otherwise, the PCTLDEF=5 method is used when you use the one-pass approach (PCTLMTD=ONEPASS).

Let n be the number of nonmissing values for a variable, and let x₁, x₂, ... , x_n represent the ordered values of the variable. For the tth percentile, let p = t/100. In the following definitions numbered 1, 2, 3, and 5, let

np = j + g

where j is the integer part and g is the fractional part of np. For definition 4, let

(n+1)p = j + g

Given the preceding definitions, the tth percentile, y, is defined as follows:

PCTLDEF=1

weighted average at x_np

y = (1 - g)x_j + gx_j+1

where x₀ is taken to be x₁

PCTLDEF=2

observation numbered closest to np

y = x_i

where i is the integer part of np + 1/2 if $g \neq 1/2$ . If g=1/2, then y=x_j if j is even, or y=x_j+1 if j is odd

PCTLDEF=3

empirical distribution function

y = x_j if g = 0

y = x_j+1 if g > 0

PCTLDEF=4

weighted average aimed at x_p(n+1)

y=(1 - g)x_j + gx_j+1

where x_n+1 is taken to be x_n

PCTLDEF=5

empirical distribution function with averaging

y = (x_j + x_j+1)/2 if g = 0

y = x_j+1 if g > 0

Chapter Contents
Previous
Next
Top