Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
HISTOGRAM Statement

Printed Output

If you request a fitted parametric distribution, printed output summarizing the fit is produced in addition to the graphical display. Figure 4.9 shows the printed output for a fitted lognormal distribution requested by the following statements:
   proc capability;
      spec target=14 lsl=13.95 usl=14.05;
      histogram / lognormal(indices midpercents);
   run;
The summary is organized into the following parts: These parts are described in the sections that follow.

Parameters

This section lists the parameters for the fitted curve as well as the estimated mean and estimated standard deviation. See "Formulas for Fitted Curves".

 
The CAPABILITY Procedure
Fitted Lognormal Distribution for width

Parameters for Lognormal Distribution
Parameter Symbol Estimate
Threshold Theta 0
Scale Zeta 2.638966
Shape Sigma 0.001497
Mean   13.99873
Std Dev   0.020952
 
Goodness-of-Fit Tests for Lognormal Distribution
Test Statistic DF p Value
Kolmogorov-Smirnov D 0.09148348   Pr > D >0.150
Cramer-von Mises W-Sq 0.05040427   Pr > W-Sq >0.500
Anderson-Darling A-Sq 0.33476355   Pr > A-Sq >0.500
Chi-Square Chi-Sq 2.87938822 3 Pr > Chi-Sq 0.411
 
Capability Indices
Based on Lognormal
Distribution
Cp 0.795463
CPL 0.776822
CPU 0.814021
Cpk 0.776822
Cpm 0.792237
 
Histogram Bin Percents
for Lognormal Distribution
Bin
Midpoint
Percent
Observed Estimated
13.95 4.000 2.963
13.97 18.000 15.354
13.99 26.000 33.872
14.01 38.000 32.055
14.03 10.000 13.050
14.05 4.000 2.281
 
Quantiles for Lognormal Distribution
Percent Quantile
Observed Estimated
1.0 13.9440 13.9501
5.0 13.9656 13.9643
10.0 13.9710 13.9719
25.0 13.9860 13.9846
50.0 14.0018 13.9987
75.0 14.0129 14.0129
90.0 14.0218 14.0256
95.0 14.0241 14.0332
99.0 14.0470 14.0475
Figure 4.9: Sample Summary of Fitted Distribution

Chi-Square Goodness-of-Fit Test

The chi-square goodness-of-fit statistic for a fitted parametric distribution is computed as follows:

\chi^2 = \sum_{i=1}^m \frac{ ( O_{i} -E_{i} )^2 }{E_{i}}

where 
 
		 Oi = observed percentage in i th histogram interval
		 Ei = expected percentage in i th histogram interval
		 m = number of histogram intervals
		 p = number of estimated parameters
The degrees of freedom for the chi-square test is equal to m-p-1. You can save the observed and expected interval percentages in the OUTFIT= data set discussed in "Output Data Sets".

Note that empty intervals are not combined, and the range of intervals used to compute \chi^2 begins with the first interval containing observations and ends with the final interval containing observations.

EDF Goodness-of-Fit Tests

When you fit a parametric distribution, the HISTOGRAM statement provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). The EDF tests offer advantages over the chi-square goodness-of-fit test, including improved power and invariance with respect to the histogram midpoints. For a thorough discussion, refer to D'Agostino and Stephens (1986).

The empirical distribution function is defined for a set of n independent observations X1, ... ,Xn with a common distribution function F(x). Denote the observations ordered from smallest to largest as X(1), ... ,X(n). The empirical distribution function, Fn(x), is defined as

F_n(x) = 0, & x \lt X_{(1)} \ F_n(x) = \frac{i}n, & X_{(i)} \leq x \lt X_{(i+1)}
 & i=1, ... ,n-1 \ 
 F_n(x) = 1, & X_{(n)} \leq x
Note that Fn(x) is a step function that takes a step of height [1/n] at each observation. This function estimates the distribution function F(x). At any value x, Fn(x) is the proportion of observations less than or equal to x, while F(x) is the probability of an observation less than or equal to x. EDF statistics measure the discrepancy between Fn(x) and F(x).

The computational formulas for the EDF statistics make use of the probability integral transformation U=F(X). If F(X) is the distribution function of X, the random variable U is uniformly distributed between 0 and 1.

Given n observations X(1), ... ,X(n), the values U(i)=F(X(i)) are computed by applying the transformation, as shown in the following sections.

The HISTOGRAM statement provides three EDF tests:

These tests are based on various measures of the discrepancy between the empirical distribution function Fn(x) and the proposed parametric cumulative distribution function F(x).

The following sections provide formal definitions of the EDF statistics.

Kolmogorov-Smirnov Statistic

The Kolmogorov-Smirnov statistic (D) is defined as
D = {sup}_x| F_{n}(x)-F(x)|
The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between F(x) and Fn(x).

The Kolmogorov-Smirnov statistic is computed as the maximum of D+ and D-, where D+ is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function, and D- is the largest vertical distance when the EDF is less than the distribution function.

D^{+} & = & \max_{i}(\frac{i}n - U_{(i)}) \ D^{-} & = & \max_{i}(U_{(i)} - \frac{i-1}n) \ D & = & \max(D^{+},D^{-})

Anderson-Darling Statistic

The Anderson-Darling statistic and the Cram\acute{e}r-von Mises statistic belong to the quadratic class of EDF statistics. This class of statistics is based on the squared difference (Fn(x)- F(x))2. Quadratic statistics have the following general form:
Q = n \int_{-\infty}^{+\infty} (F_n(x)-F(x))^2
 \psi(x) dF(x)
The function \psi(x) weights the squared difference (Fn(x)- F(x))2.

The Anderson-Darling statistic (A2) is defined as

A^2 = n\int_{-\infty}^{+\infty}(F_n(x)-F(x))^2
 [F(x)(1-F(x))]^{-1} dF(x)
Here the weight function is \psi(x) = [F(x)(1-F(x))]^{-1}.

The Anderson-Darling statistic is computed as

A^2 = -n-\frac{1}n\sum_{i=1}^n
 [(2i-1)\log U_{(i)} + (2n+1-2i)
 \log(\{1-U_{(i)})]

Cram\acute{e}r-von Mises Statistic

The Cram\acute{e}r-von Mises statistic (W2) is defined as
W^2 = n \int_{-\infty}^{+\infty}
 (F_{n}(x)-F(x))^2 dF(x)
Here the weight function is  \psi(x) = 1.

The Cram\acute{e}r-von Mises statistic is computed as

W^2 = \sum_{i=1}^n(U_{(i)}-\frac{2i-1}{2n})^2
 + \frac{1}{12n}

Probability Values for EDF Tests

Once the EDF test statistics are computed, the associated probability values (p-values) must be calculated. The CAPABILITY procedure uses internal tables of probability levels similar to those given by D'Agostino and Stephens (1986). If the value is between two probability levels, then linear interpolation is used to estimate the probability value.

The probability value depends upon the parameters that are known and the parameters that are estimated for the distribution you are fitting. Table 4.17 summarizes different combinations of estimated parameters for which EDF tests are available.

Note: The threshold (THETA=) parameter for the beta, exponential, gamma, lognormal, and Weibull distributions is assumed to be known. If you do not specify its value, it is assumed to be zero and known. Likewise, the SIGMA= parameter, which determines the upper threshold (SIGMA) for the beta distribution, is assumed to be known; if you do not specify its value, it is assumed to be one. These parameters are not listed in Table 4.17 because they are assumed to be known in all cases, and they do not affect which EDF statistics are computed.

Table 4.17: Availability of EDF Tests
Distribution Parameters EDF Tests Available
Beta\alpha and \beta unknownnone
 \alpha known, \beta unknownnone
 \alpha unknown, \beta knownnone
 \alpha and \beta knownall
Exponential\sigma unknownall
 \sigma knownall
Gamma\alpha and \sigma unknownnone
 \alpha known, \sigma unknownnone
 \alpha unknown, \sigma knownnone
 \alpha and \sigma knownall
Lognormal\zeta and \sigma unknownall
 \zeta known, \sigma unknownA2 and W2
 \zeta unknown, \sigma knownA2 and W2
 \zeta and \sigma knownall
Normal\mu and \sigma unknownall
 \mu known, \sigma unknownA2 and W2
 \mu unknown, \sigma knownA2 and W2
 \mu and \sigma knownall
Weibullc and \sigma unknownA2 and W2
 c known, \sigma unknownA2 and W2
 c unknown, \sigma knownA2 and W2
 c and \sigma knownall

Specifications

This section is included in the summary only if you provide specification limits, and it tabulates the limits as well as the observed percentages and estimated percentages outside the limits.

The estimated percentages are computed only if fitted distributions are requested and are based on the probability that an observed value exceeds the specification limits, assuming the fitted distribution. The observed percentages are the percents of observations outside the specification limits.

Indices Using Fitted Curves

This section is included in the summary only if you specify the INDICES option in parentheses after a distribution option, as in the statements that produce Figure 4.9. Standard process capability indices, such as Cp and Cpk, are not appropriate if the data are not normally distributed. The INDICES option computes generalizations of the standard indices using the fact that for the normal distribution, 3\sigma is both the distance from the lower 0.135 percentile to the median (or mean) and the distance from the median (or mean) to the upper 99.865 percentile. These percentiles are estimated from the fitted distribution, and the appropriate percentile-to-median distances are substituted for 3\sigma in the standard formulas.

Writing T for the target, LSL and USL for the lower and upper specification limits, and P_{\alpha} for the 100\alpha\!^{{\scriptsize th}} percentile, the generalized capability indices are as follows:

Cpl = [(P0.5 - LSL )/(P0.5-P0.00135)]

Cpu = [(USL - P0.5 )/(P0.99865-P0.5)]

Cp = [(USL - LSL)/(P0.99865-P0.00135)]

Cpk = min([(P0.5 - LSL)/(P0.5-P0.00135)],[( USL - P0.5)/(P0.99865-P0.5)])

K = 2 ×[(|(1/2)( USL+ LSL) - P0.5|)/( USL - LSL )]

C_{pm} = \frac{{min} (
 \frac{T-{LSL}}{P_{0.5}-P_{0.00135}},
 \frac{{USL}-T}{P_{0.99865}-P_{0.5}})}
 {\sqrt{1+(\frac{\mu - T}{\sigma})^2}}

If the data are normally distributed, these formulas reduce to the formulas for the standard capability indices, which are given at "Standard Capability Indices" . The following guidelines apply to the use of generalized capability indices requested with the INDICES option:

At the time of this writing, there is ongoing research concerning the application of generalized capability indices, and it is important to note that other approaches can be used with nonnormal data:


Refer to Rodriguez (1992) for additional discussion.

Histogram Intervals

This section is included in the summary only if you specify the MIDPERCENTS option in parentheses after the distribution option, as in the statements that produce Figure 4.9. This table lists the interval midpoints along with the observed and estimated percentages of the observations that lie in the interval. The estimated percentages are based on the fitted distribution.

In addition, you can specify the MIDPERCENTS option to request a table of interval midpoints with the observed percent of observations that lie in the interval. See the entry for the MIDPERCENTS option.

Quantiles

This table lists observed and estimated quantiles. You can use the PERCENTS= option to specify the list of quantiles to appear in this list. The list in Figure 4.9 is the default list. See the entry for the PERCENTS= option.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.