Statistics Based on the Empirical Distribution Function

The NPAR1WAY Procedure

Statistics Based on the Empirical Distribution Function

If you specify the EDF option, PROC NPAR1WAY computes statistics based on the empirical distribution function. These include Kolmogorov-Smirnov and Cramer-von Mises statistics, and also Kuiper statistics for two-sample data. This section gives formulas for these statistics. For further information on the formulas and the interpretation of EDF statistics, refer to Hollander and Wolfe (1973) and Gibbons and Chakraborti (1992). For details on the k-sample analogues of the Kolmogorov-Smirnov and Cramer-von Mises statistics used by NPAR1WAY, refer to Kiefer (1959).

The empirical distribution function (EDF) of a sample {x_j}, j = 1,2, ... ,n, is defined as the following function:

$F(x) = \frac{1}n ({number of } x_j \leq x) = \frac{1}n \sum_{j=1}^n I(x_j \leq x)$

where I(·) is the indicator function. PROC NPAR1WAY uses the subsample of values within the ith class level to generate an EDF, F_i. The EDF for the pooled sample can also be expressed as

$F = \frac{1}n \sum_i n_i F_i$

where n_i is the number of observations in the ith class level, and n is the total number of observations.

Kolmogorov-Smirnov Statistics

The Kolmogorov-Smirnov statistic measures the maximum deviation of the EDF within the classes from the pooled EDF. PROC NPAR1WAY computes the Kolmogorov-Smirnov statistic as

${KS} = \max_j \sqrt{ \frac{1}n \sum_i n_i ( F_i(x_j) - F(x_j) )^2 } {where } j = 1,2, ... ,n$

The asymptotic Kolmogorov-Smirnov statistic is computed as

${KS}_a = {KS} \sqrt{n}$

In addition to the overall Kolmogorov-Smirnov statistic and the asymptotic statistic, PROC NPAR1WAY displays the values of the F_i at the maximum deviation from F, the values $(F_i-F)\sqrt{n_i}$ at the maximum deviation from F, the value of F at the maximum deviation, and the point where this maximum deviation occurs.

If there are only two class levels, PROC NPAR1WAY computes the two-sample Kolmogorov statistic as

D = max_j | F₁ (x_j) - F₂(x_j) | where j = 1,2, ... ,n

PROC NPAR1WAY also computes the asymptotic probability of observing a larger test statistic. The quality of this approximation has been studied by Hodges (1957). For tables of the exact distribution of D when the two sample sizes are equal, refer to Lehmann (1975, p. 413). For tables of the exact distribution for unequal sample sizes, refer to Kim and Jennrich (1970, pp. 79 -170).

Cramer-von Mises Statistics

The Cramer-von Mises statistic is defined as

${CM} = \frac{1}{n^2} \sum_i ( n_i \sum_{j=1}^p t_j ( F_i(x_j) - F(x_j) )^2 )$

where t_j is the number of ties at the jth distinct value and p is the number of distinct values. CM measures the integrated deviation of the EDF within the classes to the pooled EDF. PROC NPAR1WAY displays the contribution of each class to the sum together with the sum, which is the asymptotic value formed by multiplying the Cramer-von Mises statistic by the number of observations.

Kuiper Statistics

For data with two class levels, PROC NPAR1WAY computes the Kuiper statistic, its scaled value for the asymptotic distribution, and the asymptotic p-value. The Kuiper statistic is computed as

K = max_j ( F₁(x_j) - F₂(x_j) ) - min_j ( F₁(x_j) - F₂(x_j) ) where j = 1,2, ... ,n

The asymptotic value is

$K_a = K \sqrt{ \frac{n_1 n_2}n }$

The p-value is the probability of observing a larger value of K_a under the null hypothesis of no difference between the two classes.

Chapter Contents
Previous
Next
Top