Chapter Contents

Previous

Next
The UNIVARIATE Procedure

Statistical Computations

PROC UNIVARIATE uses standard algorithms to compute the moment statistics (such as the mean, variance, skewness, and kurtosis). See SAS Elementary Statistics Procedures for the statistical formulas. The computational details for confidence limits, hypothesis test statistics, and quantile statistics follow.


Confidence Limits for Parameters of the Normal Distribution
The two-sided [IMAGE] percent confidence interval for the mean has upper and lower limits

[IMAGE]

where [IMAGE] is [IMAGE] and [IMAGE] is the ( [IMAGE]) percentile of the t distribution with [IMAGE] degrees of freedom.

The one-sided [IMAGE] percent confidence limit is computed as

[IMAGE]

The two-sided [IMAGE] percent confidence interval for the standard deviation has lower and upper limits

[IMAGE]

where [IMAGE] and [IMAGE] are the [IMAGE] and [IMAGE] percentiles of the chi-square distribution with [IMAGE] degrees of freedom. A one-sided [IMAGE] percent confidence limit is computed by replacing [IMAGE] with [IMAGE].

A [IMAGE] percent confidence interval for the variance has upper and lower limits equal to the squares of the corresponding upper and lower limits for the standard deviation.

When you use the WEIGHT statement and specify VARDEF=DF in the PROC statement, the [IMAGE] percent confidence interval for the weighted mean is

[IMAGE]

where [IMAGE] is the weighted mean, [IMAGE] is the weighted standard deviation, [IMAGE] is the weight for [IMAGE] observation, and [IMAGE] is the [IMAGE] critical percentage for the t distribution with [IMAGE] degrees of freedom.


Tests for Location
PROC UNIVARIATE computes tests for location that include Student's t test, the sign test, and the Wilcoxon signed rank test. All three tests produce a test statistic for the null hypothesis that the mean or median is equal to a given value [IMAGE] against the two-sided alternative that the mean or median is not equal to [IMAGE]. By default, PROC UNIVARIATE sets the value of [IMAGE] to zero. Use the MU0= option in the PROC UNIVARIATE statement to test that the mean or median is equal to another value.

The Student's t test is appropriate when the data are from an approximately normal population; otherwise, use nonparametric tests such as the sign test or the signed rank test. For large sample situations, the t test is asymptotically equivalent to a z test.

If you use the WEIGHT statement, PROC UNIVARIATE computes only one weighted test for location, the t test. You must use the default value for the VARDEF= option in the PROC statement.

You can also compare means or medians of paired data. Data are said to be paired when subjects or units are matched in pairs according to one or more variables, such as pairs of subjects with the same age and gender. Paired data also occur when each subject or unit is measured at two times or under two conditions. To compare the means or medians of the two times, create an analysis variable that is the difference between the two measures. The test that the mean or the median difference of the variables equals zero is equivalent to the test that the means or medians of the two original variables are equal. See Performing a Sign Test Using Paired Data .

Student's t Test

PROC UNIVARIATE calculates the t statistic as

[IMAGE]

where [IMAGE] is the sample mean, [IMAGE] is the number of nonmissing values for a variable, and [IMAGE] is the sample standard deviation. Under the null hypothesis, the population mean equals [IMAGE]. When the data values are approximately normally distributed, the probability under the null hypothesis of a t statistic that is as extreme, or more extreme, than the observed value (the p-value) is obtained from the t distribution with [IMAGE] degrees of freedom. For large [IMAGE], the t statistic is asymptotically equivalent to a z test.

When you use the WEIGHT statement and the default value of VARDEF=, which is DF, the t statistic is calculated as

[IMAGE]

where [IMAGE] is the weighted mean, [IMAGE] is the weighted standard deviation, and [IMAGE] is the weight for [IMAGE] observation. The [IMAGE] statistic is treated as having a Student's t distribution with [IMAGE] degrees of freedom. If you specify the EXCLNPWGT option in the PROC statement, [IMAGE] is the number of nonmissing observations when the value of the WEIGHT variable is positive. By default, [IMAGE] is the number of nonmissing observations for the WEIGHT variable.

Sign Test

PROC UNIVARIATE calculates the sign test statistic as

[IMAGE]

where [IMAGE] is the number of values that is greater than [IMAGE] and [IMAGE] is the number of values that is less than [IMAGE]. Values equal to [IMAGE] are discarded.

Under the null hypothesis that the population median is equal to [IMAGE], the p-value for the observed statistic M is

[IMAGE]

where [IMAGE] is the number of [IMAGE] values not equal to [IMAGE].

Wilcoxon Signed Rank Test

PROC UNIVARIATE calculates the Wilcoxon signed rank test statistic as

[IMAGE]

where [IMAGE] is the rank of [IMAGE] after discarding values of [IMAGE] equal to [IMAGE], [IMAGE] is the number of [IMAGE] values not equal to [IMAGE], and the sum is calculated for values of [IMAGE] greater than 0. Average ranks are used for tied values.

The p-value is the probability of obtaining a signed rank statistic greater in absolute value than the absolute value of the observed statistic S. If [IMAGE], the significance level of [IMAGE] is computed from the exact distribution of [IMAGE], which can be enumerated under the null hypothesis that the distribution is symmetric about [IMAGE]. When [IMAGE], the significance of level [IMAGE] is computed by treating

[IMAGE]

as a Student's t variate with [IMAGE] degrees of freedom. [IMAGE] is computed as

[IMAGE]

where the sum is calculated over groups that are tied in absolute value, and [IMAGE] is the number of tied values in the [IMAGE]th group (Iman 1974; Conover 1980).

The Wilcoxon signed rank test assumes that the distribution is symmetric. If the assumption is not valid, you can use the sign test to test that the median is [IMAGE]. See Lehmann (1975) for more details.


Goodness-of-Fit Tests
When you specify the NORMAL option in the PROC UNIVARIATE statement or you request a fitted parametric distribution in the HISTOGRAM statement, the procedure computes test statistics for the null hypothesis that the values of the analysis variable are a random sample from the specified theoretical distribution. When you specify the normal distribution, the test statistics depend on the sample size. If the sample size is less than or equal to 2000, PROC UNIVARIATE calculates the Shapiro-Wilk W statistic. For a specified distribution, the procedure attempts to calculate three goodness-of-fit tests that are based on the empirical distribution function (EDF): the Kolmogorov-Smirnov D statistic, the Anderson-Darling statistic, and the Cramer-von Mises statistic. However, some of the EDF tests are currently not supported when the parameters of a specified distribution are estimated. See Availability of EDF Tests for more information.

You determine whether to reject the null hypothesis by examining the probability that is associated with a test statistic. When the p-value is less than the predetermined critical value (alpha value), you reject the null hypothesis and conclude that the data came from the theoretical distribution.

If you want to test the normality assumptions that underlie analysis of variance methods, beware of using a statistical test for normality alone. A test's ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected. Since small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option, the HISTOGRAM statement, PROBPLOT statement, and QQPLOT statement can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the test's ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20, rather than the often-used 0.05 level. Again, consulting plots and additional statistics will help you assess the severity of the deviations from normality.

Shapiro-Wilk Statistic

If the sample size is less than or equal to 2000 and you specify the NORMAL option, PROC UNIVARIATE computes the Shapiro-Wilk statistic, W. The W statistic is the ratio of the best estimator of the variance (based on the square of a linear combination of the order statistics) to the usual corrected sum of squares estimator of the variance (Shapiro, 1965). W must be greater than zero and less than or equal to one. Small values of W lead to the rejection of the null hypothesis of normality. The distribution of W is highly skewed. Seemingly large values of W (such as 0.90) may be considered small and lead you to reject the null hypothesis. When the sample size is greater than three, the coefficients to compute the linear combination of the order statistics are approximated by the method of Royston (1992).

[IMAGE]

when [IMAGE] and

[IMAGE]

when [IMAGE], where [IMAGE] and [IMAGE] are functions of [IMAGE], obtained from simulation results, and [IMAGE] is a standard normal variate. Large values of [IMAGE] indicate departure from normality.

EDF Goodness-of-Fit Tests

When you fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests that are based on the empirical distribution function (EDF). The empirical distribution function is defined for a set of [IMAGE] independent observations [IMAGE] with a common distribution function [IMAGE]. The observations that are ordered from smallest to largest as [IMAGE]. The empirical distribution function, [IMAGE], is defined as

[IMAGE]

Note that [IMAGE] is a step function that takes a step of height [IMAGE] at each observation. This function estimates the distribution function [IMAGE]. At any value [IMAGE] is the proportion of observations that is less than or equal to [IMAGE] while [IMAGE] is the theoretical probability of an observation that is less than or equal to [IMAGE]. EDF statistics measure the discrepancy between [IMAGE] and [IMAGE].

The computational formulas for the EDF statistics use the probability integral transformation [IMAGE]. If [IMAGE] is the distribution function of [IMAGE], the random variable [IMAGE] is uniformly distributed between 0 and 1.

Given [IMAGE] observations [IMAGE], PROC UNIVARIATE computes the values [IMAGE] by applying the transformation, as follows.

When you specify the NORMAL option in the PROC UNIVARIATE statement or use the HISTOGRAM statement to fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests that are based on the empirical distribution function (EDF):

These tests are based on various measures of the discrepancy between the empirical distribution function [IMAGE] and the proposed cumulative distribution function [IMAGE].

Once the EDF test statistics are computed, the associated p-values are calculated. PROC UNIVARIATE uses internal tables of probability levels that are similar to those given by D'Agostino and Stephens (1986). If the value lies between two probability levels, then linear interpolation is used to estimate the probability value.

Note:   PROC UNIVARIATE does not support some of the EDF tests when you use the HISTOGRAM statement and you estimate the parameters of the specified distribution. See Availability of EDF Tests for more information.  [cautionend]

Kolmogorov D Statistic

The Kolmogorov-Smirnov statistic (D) is defined as

[IMAGE]

The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between [IMAGE] and [IMAGE].

The Kolmogorov-Smirnov statistic is computed as the maximum of [IMAGE] and [IMAGE]. [IMAGE] is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function. [IMAGE] is the largest vertical distance when the EDF is less than the distribution function.

[IMAGE]

PROC UNIVARIATE uses a modified Kolmogorov D statistic to test the data against a normal distribution with mean and variance equal to the sample mean and variance.

Anderson-Darling Statistic

The Anderson-Darling statistic and the Cramer-von Mises statistic belong to the quadratic class of EDF statistics. This class of statistics is based on the squared difference [IMAGE]. Quadratic statistics have the following general form:

[IMAGE]

The function [IMAGE] weights the squared difference [IMAGE].

The Anderson-Darling statistic ( [IMAGE]) is defined as

[IMAGE]

where the weight function is [IMAGE].

The Anderson-Darling statistic is computed as

[IMAGE]


Cramer-von Mises Statistic

The Cramer-von Mises statistic ( [IMAGE]) is defined as

[IMAGE]

where the weight function is [IMAGE].

The Cramer-von Mises statistic is computed as

[IMAGE]


Probability Values of EDF Tests

Once the EDF test statistics are computed, PROC UNIVARIATE computes the associated probability values.

The probability value depends upon the parameters that are known and the parameters that PROC UNIVARIATE estimates for the fitted distribution. Availability of EDF Tests summarizes different combinations of estimated parameters for which EDF tests are available.

Note:   PROC UNIVARIATE assumes that the threshold (THETA=) parameter for the beta, exponential, gamma, lognormal, and Weibull distributions is known. If you omit its value, PROC UNIVARIATE assumes that it is zero and that it is known. Likewise, PROC UNIVARIATE assumes that the SIGMA= parameter, which determines the upper threshold (SIGMA) for the beta distribution, is known. If you omit its value, PROC UNIVARIATE assumes that the value is one. These parameters are not listed in Availability of EDF Tests because they are assumed to be known in all cases, and they do not affect which EDF statistics PROC UNIVARIATE computes.  [cautionend]

Availability of EDF Tests
Distribution Parameters EDF
Beta [IMAGE] and [IMAGE] unknown

[IMAGE] known, [IMAGE] unknown

[IMAGE] unknown, [IMAGE] known

[IMAGE] and [IMAGE] known

none

none

none

all

Exponential [IMAGE] unknown

[IMAGE] known

all

all

Gamma [IMAGE] and [IMAGE] unknown

[IMAGE] known, [IMAGE] unknown

[IMAGE] unknown, [IMAGE] known

[IMAGE] and [IMAGE] known

none

none

none

all

Lognormal [IMAGE] and [IMAGE] unknown

[IMAGE] known, [IMAGE] unknown

[IMAGE] unknown, [IMAGE] known

[IMAGE] and [IMAGE] known

all

[IMAGE] and [IMAGE]

[IMAGE] and [IMAGE]

all

Normal [IMAGE] and [IMAGE] unknown

[IMAGE] known, [IMAGE] unknown

[IMAGE] unknown, [IMAGE] known

[IMAGE] and [IMAGE] known

all

[IMAGE] and [IMAGE]

[IMAGE] and [IMAGE]

all

Weibull [IMAGE] and [IMAGE] unknown

[IMAGE] known, [IMAGE] unknown

[IMAGE] unknown, [IMAGE] known

[IMAGE] and [IMAGE] known

[IMAGE] and [IMAGE]

[IMAGE] and [IMAGE]

[IMAGE] and [IMAGE]

all



Robust Estimators
A statistical method is robust if the method is insensitive to slight departures from the assumptions that justify the method. PROC UNIVARIATE provides several methods for robust estimation of location and scale.

Winsorized Means

When outliers are present in the data, the Winsorized mean is a robust estimator of the location that is relatively insensitive to the outlying values. The k-times Winsorized mean is calculated as

[IMAGE]

The Winsorized mean is computed after the [IMAGE] smallest observations are replaced by the ( [IMAGE]) smallest observation, and the [IMAGE] largest observations are replaced by the ( [IMAGE]) largest observation.

For a symmetric distribution, the symmetrically Winsorized mean is an unbiased estimate of the population mean. But the Winsorized mean does not have a normal distribution even if the data are from a normal population.

The Winsorized sum of squared deviations is defined as

[IMAGE]

A Winsorized t test is given by

[IMAGE]

where the standard error of the Winsorized mean is

[IMAGE]

When the data are from a symmetric distribution, the distribution of the Winsorized t statistic [IMAGE] is approximated by a Student's t distribution with [IMAGE] degrees of freedom (Tukey and McLaughlin 1963, Dixon and Tukey 1968).

A [IMAGE] percent confidence interval for the Winsorized mean has upper and lower limits

[IMAGE]

and the ( [IMAGE]) critical value of the Student's t statistics has [IMAGE] degrees of freedom.

Trimmed Means

When outliers are present in the data, the trimmed mean is a robust estimator of the location that is relatively insensitive to the outlying values. The [IMAGE]-times trimmed mean is calculated as

[IMAGE]

The trimmed mean is computed after the [IMAGE] smallest and [IMAGE] largest observations are deleted from the sample. In other words, the observations are trimmed at each end.

For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. But the trimmed mean does not have a normal distribution even if the data are from a normal population.

A robust estimate of the variance of the trimmed mean [IMAGE] can be based on the Winsorized sum of squared deviations (Tukey and McLaughlin 1963). The resulting trimmed t test is given by

[IMAGE]

where the standard error of the trimmed mean is

[IMAGE]

and [IMAGE] is the square root of the Winsorized sum of squared deviations

When the data are from a symmetric distribution, the distribution of the trimmed t statistic [IMAGE] is approximated by a Student's t distribution with [IMAGE] degrees of freedom (Tukey and McLaughlin 1963, Dixon and Tukey 1968).

A [IMAGE] percent confidence interval for the trimmed mean has upper and lower limits

[IMAGE]

and the ( [IMAGE]) critical value of the Student's t statistics has [IMAGE] degrees of freedom.

Robust Measures of Scale

The sample standard deviation is a commonly used estimator of the population scale. However, it is sensitive to outliers and may not remain bounded when a single data point is replaced by an arbitrary number. With robust scale estimators, the estimates remain bounded even when a portion of the data points are replaced by arbitrary numbers.

PROC UNIVARIATE computes robust measures of scale that include statistics of interquartile range, Gini's mean difference G, MAD, [IMAGE], and [IMAGE], with their corresponding estimates of [IMAGE].

The interquartile range is a simple robust scale estimator, which is the difference between the upper and lower quartiles. For a normal population, the standard deviation [IMAGE] can be estimated by dividing the interquartile range by 1.34898.

Gini's mean difference is also a robust estimator of the standard deviation [IMAGE]. For a normal population, Gini's mean difference has expected value [IMAGE]. Thus, multiplying Gini's mean difference by [IMAGE] yields a robust estimator of the standard deviation when the data are from a normal sample. The constructed estimator has high efficiency for the normal distribution relative to the usual sample standard deviation. It is also less sensitive to the presence of outliers than the sample standard deviation.

Gini's mean difference is computed as

[IMAGE]

If the observations are from a normal distribution, then [IMAGE] is an unbiased estimator of the standard deviation [IMAGE].

A very robust scale estimator is the MAD, the median absolute deviation about the median (Hampel, 1974.)

[IMAGE]

where the inner median, [IMAGE], is the median of the [IMAGE] observations and the outer median, [IMAGE], is the median of the [IMAGE] absolute values of the deviations about the median.

For a normal distribution, 1.4826·MAD can be used to estimate the standard deviation [IMAGE].

The MAD statistic has low efficiency for normal distributions, and it may not be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two new statistics as alternatives to the MAD statistic.

The first statistic is

[IMAGE]

where the outer median, [IMAGE], is the median of the [IMAGE] medians of [IMAGE].

To reduce the small-sample bias, [IMAGE] is used to estimate the standard deviation [IMAGE], where [IMAGE] is a the correction factor (Croux and Rousseeuw, 1992.)

The second statistic is

[IMAGE]

where [IMAGE], and [IMAGE] is the integer part of [IMAGE]. That is, [IMAGE] is 2.2219 times the [IMAGE]th order statistic of the [IMAGE]distances between data points.

The bias-corrected statistic, [IMAGE], is used to estimate the standard deviation [IMAGE], where [IMAGE] is a correction factor.


Calculating Percentiles
The UNIVARIATE procedure automatically computes the minimum, 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 99th, and maximum percentiles. You use the PCTLDEF= option in the PROC UNIVARIATE statement to specify one of five methods to compute quantile statistics. See Percentile and Related Statistics for more information.

To compute the quantile that each observation falls in, use PROC RANK with the GROUP= option. To calculate percentiles other than the default percentiles, use PCTLPTS= and PCTLPRE= in the OUTPUT statement.

Confidence Limits for Quantiles

The CIPCTLDF option and CIPCTLNORMAL option compute confidence limits for quantiles using methods described in Hahn and Meeker (1991).

When [IMAGE], the two-sided [IMAGE] percent confidence interval for quantiles that are based on normal data has lower and upper limits

[IMAGE]

where [IMAGE] is the percentile [IMAGE].

When [IMAGE], the lower and upper limits are

[IMAGE]

A one-sided [IMAGE] percent confidence limit is computed by replacing [IMAGE] with [IMAGE]. The factor [IMAGE] is described in Owen and Hua (1977) and Odeh and Owen (1980).

The two-sided distribution-free [IMAGE]% confidence interval for quantiles from a sample of size [IMAGE] is

[IMAGE]

where [IMAGE] is jth order statistic. The lower rank [IMAGE] and upper rank [IMAGE] are integers that are symmetric or nearly symmetric around [IMAGE], where [IMAGE] is the integral part of [IMAGE].

The [IMAGE] and [IMAGE] are chosen so that the order statistics [IMAGE] and [IMAGE]

The coverage probability is sometimes less that [IMAGE]. This can occur in the tails of the distribution when the sample size is small. To avoid this problem, you can specify the option TYPE=ASYMMETRIC, which causes PROC UNIVARIATE to use asymmetric values of [IMAGE] and [IMAGE]. However, PROC UNIVARIATE first attempts to compute confidence limits that satisfy all three conditions. If the last condition is not satisfied, then the first condition is relaxed. Thus, some of the confidence limits may be symmetric while others, especially in the extremes, are not.

A one-sided distribution-free lower [IMAGE] percent confidence limit is computed as [IMAGE] when [IMAGE] is the largest integer that satisfies the inequality

[IMAGE]

where [IMAGE], and [IMAGE]. Likewise, a one-sided distribution-free upper [IMAGE]% confidence limit is computed as [IMAGE] when [IMAGE] is the smallest integer that satisfies the inequality

[IMAGE]

where [IMAGE], and [IMAGE].

Weighted Quantiles

When you use the WEIGHT statement the percentiles are computed as follows. Let [IMAGE] be the [IMAGE]th ordered nonmissing value, [IMAGE]. Then, for a given value of [IMAGE] between 0 and 1, the [IMAGE]th weighted quantile (or 100 [IMAGE]th weighted percentile), [IMAGE], is computed from the empirical distribution function with averaging

[IMAGE]

where [IMAGE] is the weight associated with [IMAGE], [IMAGE] is the sum of the weights and [IMAGE] is the weight for [IMAGE]th observation.

When the observations have identical weights, the weighted percentiles are the same as the unweighted percentiles with PCTLDEF=5.


Calculating the Mode
The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the actual values or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest value. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode.

The WEIGHT statement has no effect on the mode.


Formulas for Fitted Continuous Distributions
The following sections provide information about the families of parametric distributions that you can fit with the HISTOGRAM statement. Properties of the parametric curves are discussed by Johnson, et al. (1994).

Beta Distribution

The fitted density function is

[IMAGE]

where [IMAGE] and

[IMAGE]

This notation is consistent with that of other distributions that you can fit with the HISTOGRAM statement. However, many texts, including Johnson, et al. (1994), write the beta density function as:

[IMAGE]

The two notations are related as follows:

[IMAGE]

The range of the beta distribution is bounded below by a threshold parameter [IMAGE] and above by [IMAGE]. If you specify a fitted beta curve using the BETA option, [IMAGE] must be less than the minimum data value, and [IMAGE] must be greater than the maximum data value. You can specify [IMAGE] and [IMAGE] with the THETA= and SIGMA= value in parentheses after the keyword BETA. By default, [IMAGE] and [IMAGE]. If you specify THETA=EST and SIGMA=EST, maximum likelihood estimates are computed for [IMAGE] and [IMAGE].

Note:   However, three- and four-parameter maximum likelihood estimation may not always converge.  [cautionend]

In addition, you can specify [IMAGE] and [IMAGE] with the ALPHA= and BETA= beta-options, respectively. By default, the procedure calculates maximum likelihood estimates for [IMAGE] and [IMAGE]. For example, to fit a beta density curve to a set of data bounded below by 32 and above by 212 with maximum likelihood estimates for [IMAGE] and [IMAGE], use the following statement:

histogram length / beta(theta=32 sigma=180);

The beta distributions are also referred to as Pearson Type I or II distributions. These include the power-function distribution ( [IMAGE]), the arc-sine distribution ( [IMAGE]), and the generalized arc-sine distributions ( [IMAGE]). You can use the DATA step function BETAINV to compute beta quantiles and the DATA step function PROBBETA to compute beta probabilities.

Exponential Distribution

The fitted density function is

[IMAGE]

where

[IMAGE]

The threshold parameter [IMAGE] must be less than or equal to the minimum data value. You can specify [IMAGE] with the THRESHOLD= exponential-option. By default, [IMAGE]. If you specify THETA=EST, a maximum likelihood estimate is computed for [IMAGE]. In addition, you can specify [IMAGE] with the SCALE= exponential-option. By default, the procedure calculates a maximum likelihood estimate for [IMAGE]. Note that some authors define the scale parameter as [IMAGE].

The exponential distribution is a special case of both the gamma distribution (with [IMAGE] and the Weibull distribution (with [IMAGE]). A related distribution is the extreme value distribution. If Y = exp (-X) has an exponential distribution, then X (Chi) has an extreme value distribution.

Gamma Distribution

The fitted density function is

[IMAGE]

where

[IMAGE]

The threshold parameter [IMAGE] must be less than the minimum data value. You can specify [IMAGE] with the THRESHOLD= gamma-option. By default, [IMAGE]. If you specify THETA=EST, a maximum likelihood estimate is computed for [IMAGE]. In addition, you can specify [IMAGE] and [IMAGE] with the SCALE= and ALPHA= gamma-options. By default, the procedure calculates maximum likelihood estimates for [IMAGE] and [IMAGE].

The gamma distributions are also referred to as Pearson Type III distributions, and they include the chi-square, exponential, and Erlang distributions. The probability density function for the chi-square distribution is

[IMAGE]

Notice that this is a gamma distribution with [IMAGE], and [IMAGE]. The exponential distribution is a gamma distribution with [IMAGE], and the Erlang distribution is a gamma distribution with [IMAGE] being a positive integer. A related distribution is the Rayleigh distribution. If [IMAGE] where the [IMAGE]'s are independent [IMAGE] variables, then [IMAGE] is distributed with a [IMAGE] distribution having a probability density function of

[IMAGE]

If [IMAGE], the preceding distribution is referred to as the Rayleigh distribution. You can use the DATA step function GAMINV to compute gamma quantiles and the DATA step function PROBGAM to compute gamma probabilities.

Lognormal Distribution

The fitted density function is

[IMAGE]

where

[IMAGE]

The threshold parameter [IMAGE] must be less than the minimum data value. You can specify [IMAGE] with the THRESHOLD= lognormal-option. By default, [IMAGE]. If you specify THETA=EST, a maximum likelihood estimate is computed for [IMAGE]. You can specify [IMAGE] and [IMAGE] with the SCALE= and SHAPE= lognormal-options, respectively. By default, the procedure calculates maximum likelihood estimates for these parameters.

Note:    [IMAGE] denotes the shape parameter of the lognormal distribution, whereas [IMAGE] denotes the scale parameter of the beta, exponential, gamma, normal, and Weibull distributions. The use of [IMAGE] to denote the lognormal shape parameter is based on the fact that [IMAGE] has a standard normal distribution if [IMAGE] is lognormally distributed.  [cautionend]

Normal Distribution

The fitted density function is

[IMAGE]

where

[IMAGE]

You can specify [IMAGE] and [IMAGE] with the MU= and SIGMA= normal-options, respectively. By default, the procedure estimates [IMAGE] with the sample mean and [IMAGE] with the sample standard deviation. You can use the DATA step function PROBIT to compute normal quantiles and the DATA step function PROBNORM to compute probabilities.

Weibull Distribution

The fitted density function is

[IMAGE]

where

[IMAGE]

The threshold parameter [IMAGE] must be less than the minimum data value. You can specify [IMAGE] with the THRESHOLD= Weibull-option. By default, [IMAGE]. If you specify THETA=EST, a maximum likelihood estimate is computed for [IMAGE]. You can specify [IMAGE] and [IMAGE] with the SCALE= and SHAPE= Weibull-options, respectively. By default, the procedure calculates maximum likelihood estimates for [IMAGE] and [IMAGE].

The exponential distribution is a special case of the Weibull distribution where [IMAGE].

Kernel Density Estimates

You can use the KERNEL option to superimpose kernel density estimates on histograms. Smoothing the data distribution with a kernel density estimate can be more effective than using a histogram to visualize features that might be obscured by the choice of histogram bins or sampling variation. For example, a kernel density estimate can also be more effective when the data distribution is multimodal. The general form of the kernel density estimator is

[IMAGE]

where [IMAGE] is a kernel function, [IMAGE] is the bandwidth, [IMAGE] is the sample size, and [IMAGE] is the [IMAGE] observation.

The KERNEL option provides three kernel functions [IMAGE]: normal, quadratic, and triangular. You can specify the function with the K=kernel-function in parentheses after the KERNEL option. Values for the K= option are NORMAL, QUADRATIC, and TRIANGULAR (with aliases of N, Q, and T, respectively). By default, a normal kernel is used. The formulas for the kernel functions are
Normal [IMAGE] for [IMAGE]
Quadratic [IMAGE] for [IMAGE]
Triangular [IMAGE] for [IMAGE]

The value of [IMAGE], referred to as the bandwidth parameter, determines the degree of smoothness in the estimated density function. You specify [IMAGE] indirectly by specifying a standardized bandwidth [IMAGE] with the C=kernel-option. If [IMAGE] is the interquartile range, and [IMAGE] is the sample size, then [IMAGE] is related to [IMAGE] by the formula

[IMAGE]

For a specific kernel function, the discrepancy between the density estimator [IMAGE] and the true density [IMAGE] is measured by the mean integrated square error (MISE):

[IMAGE]

The MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is

[IMAGE]

A bandwidth that minimizes AMISE can be derived by treating [IMAGE] as the normal density having parameters [IMAGE] and [IMAGE] estimated by the sample mean and standard deviation. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. For each estimate, the bandwidth parameter [IMAGE], the kernel function type, and the value of AMISE are reported in the SAS log.


Theoretical Distributions for Quantile-Quantile and Probability Plots
You can use the PROBPLOT and QQPLOT statements to request probability and Q-Q plots that are based on the theoretical distributions that are summarized in the following table:

Distributions and Parameters
Parameters
Distribution Density Function [IMAGE]
Range Location Scale Shape
Beta [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
Exponential [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]

Gamma [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
Lognormal (3-parameter) [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
Normal [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]

Weibull (3-parameter) [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
Weibull2 (2-parameter) [IMAGE]
[IMAGE]
[IMAGE]

(known)

[IMAGE]
[IMAGE]

You can request these distributions with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, NORMAL, WEIBULL, and WEIBULL2 options, respectively. If you omit a distribution option, the PROBPLOT statement creates a normal probability plot and the QQPLOT statement creates a normal Q-Q plot.

The following sections provide the details for constructing Q-Q plots that are based on these distributions. Probability plots are constructed similarly except that the horizontal axis is scaled in percentile units.

Beta Distribution

To create a plot that is based on the beta distribution, PROC UNIVARIATE orders the observations from smallest to largest, and plots the [IMAGE]ordered observation against the quantile [IMAGE] where [IMAGE] is the inverse normalized incomplete beta function, [IMAGE] is the number of nonmissing observations, and [IMAGE] and [IMAGE] are the shape parameters of the beta distribution.

The point pattern on the plot for ALPHA= [IMAGE] and BETA= [IMAGE] tends to be linear with intercept [IMAGE] and slope [IMAGE] if the data are beta distributed with the specific density function

[IMAGE]

where [IMAGE] and [IMAGE] is the lower threshold parameter, [IMAGE] is the scale parameter [IMAGE], [IMAGE] the first shape parameter [IMAGE] and [IMAGE] is the second shape parameter [IMAGE].

Exponential Distribution

To create a plot that is based on the exponential distribution, PROC UNIVARIATE orders the observations from smallest to largest, and plots the [IMAGE] ordered observation against the quantile [IMAGE] where [IMAGE] is the number of nonmissing observations.

The point pattern on the plot tends to be linear with intercept [IMAGE] and slope [IMAGE] if the data are exponentially distributed with the specific density function

[IMAGE]

where [IMAGE] is a threshold parameter and [IMAGE] is a positive scale parameter.

Gamma Distribution

To create a plot that is based on the gamma distribution, PROC UNIVARIATE orders the observations from smallest to largest, and plots the [IMAGE] ordered observation against the quantile [IMAGE] where [IMAGE] is the inverse normalized incomplete gamma function, [IMAGE] is the number of nonmissing observations, and [IMAGE] is the shape parameter of the gamma distribution.

The point pattern on the plot tends to be linear with intercept [IMAGE] and slope [IMAGE] if the data are gamma distributed with the specific density function

[IMAGE]

where [IMAGE] is the threshold parameter, [IMAGE] is the scale parameter [IMAGE], and [IMAGE] is the shape parameter [IMAGE].

Lognormal Distribution

To create a plot that is based on the lognormal distribution, PROC UNIVARIATE orders the observations from smallest to largest, and plots the [IMAGE] ordered observation against the quantile [IMAGE] where [IMAGE] is the inverse standard cumulative normal distribution, [IMAGE] is the number of nonmissing observations, and [IMAGE] is the shape parameter of the lognormal distribution.

The point pattern on the plot for SIGMA= [IMAGE] tends to be linear with intercept [IMAGE] and slope [IMAGE] if the data are lognormally distributed with the specific density function

[IMAGE]

where [IMAGE] is the threshold parameter, [IMAGE] is the scale parameter, and [IMAGE] is the shape parameter [IMAGE].

Normal Distribution

To create a plot that is based on the normal distribution, PROC UNIVARIATE orders the observations from smallest to largest, and plots the [IMAGE] ordered observation against the quantile [IMAGE] where [IMAGE] is the inverse cumulative standard normal distribution and [IMAGE] is the number of nonmissing observations.

The point pattern on the plot tends to be linear with intercept [IMAGE] and slope [IMAGE] if the data are normally distributed with the specific density function

[IMAGE]

where [IMAGE] is the mean and [IMAGE] is the standard deviation ( [IMAGE]) .

Three-Parameter Weibull Distribution

To create a plot that is based on a three-parameter Weibull distribution, PROC UNIVARIATE orders the observations from smallest to largest, and plots the [IMAGE] ordered observation against the quantile [IMAGE] where [IMAGE] is the number of nonmissing observations, and [IMAGE] and [IMAGE] are the Weibull distribution shape parameters.

The point pattern on the plot for C= [IMAGE] tends to be linear with intercept [IMAGE] and slope [IMAGE] if the data are Weibull distributed with the specific density function

[IMAGE]

where [IMAGE] is the threshold parameter, [IMAGE] is the scale parameter [IMAGE], and [IMAGE] is the shape parameter [IMAGE].

Two-Parameter Weibull Distribution

To create a plot that is based on a two-parameter Weibull distribution, PROC UNIVARIATE orders the observations from smallest to largest, and plots the log of the shifted [IMAGE] ordered observation [IMAGE], denoted by [IMAGE], against the quantile [IMAGE] where [IMAGE] is the number of nonmissing observations.

Unlike the three-parameter Weibull quantile, the preceding expression is free of distribution parameters. This is why the C= shape parameter is not required in the WEIBULL2 option.

The point pattern on the plot for THETA= [IMAGE] tends to be linear with intercept [IMAGE] and slope [IMAGE] if the data are Weibull distributed with the specific density function

[IMAGE]

where [IMAGE] is the known lower threshold, [IMAGE] is the scale parameter [IMAGE], and [IMAGE] is the shape parameter [IMAGE].

Shape Parameters

Some distribution options in the PROBPLOT and QQPLOT statements require that you specify one or two shape parameters in parentheses after the distribution keyword. These are summarized in the following table:

Shape Parameter Options
Distribution Keyword Required Shape Parameter Option Range
BETA ALPHA= [IMAGE], BETA= [IMAGE] [IMAGE]
EXPONENTIAL None
GAMMA ALPHA= [IMAGE] [IMAGE]
LOGNORMAL SIGMA= [IMAGE] [IMAGE]
NORMAL None
WEIBULL C= [IMAGE] [IMAGE]
WEIBULL2 None

You can visually estimate the value of a shape parameter by specifying a list of values for the shape parameter option. PROC UNIVARIATE produces a separate plot for each value. Then you can use the value of the shape parameter that produces the most nearly linear point pattern. Alternatively, you can request that PROC UNIVARIATE use an estimated shape parameter to create the plot.

Note:   For Q-Q plots that are requested with the WEIBULL2 option, you can estimate the shape parameter [IMAGE] from a linear pattern by using the fact that the slope of the pattern is [IMAGE].  [cautionend]

Location and Scale Parameters

When you use the PROBPLOT statement to specify or estimate the location and scale parameters for a distribution, a diagonal distribution reference line appears on the probability plot. (An exception is the two-parameter Weibull distribution, where the line appears when you specify or estimate the scale and shape parameters.) Agreement between this line and the point pattern indicates that the distribution with these parameters is a good fit.

Note:   Close visual agreement may not necessarily mean that the distribution is a good fit based on the criteria that are used by formal goodness-of-fit tests.  [cautionend]

When the point pattern on a Q-Q plot is linear, its intercept and slope provide estimates of the location and scale parameters. (An exception to this rule is the two-parameter Weibull distribution, for which the intercept and slope are related to the scale and shape parameters.) When you use the QQPLOT statement to specify or estimate the slope and intercept of the line, a diagonal distribution reference line appears on the Q-Q plot. This line allows you to check the linearity of the point pattern.

The following table shows which parameters to specify to determine the intercept and slope of the line:

Intercept and Slope of Distribution Reference Line

Parameters Linear Pattern
Distribution Location Scale Shape Intercept Slope
BETA [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
EXPONENTIAL [IMAGE]
[IMAGE]

[IMAGE]
[IMAGE]
GAMMA [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
LOGNORMAL [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
NORMAL [IMAGE]
[IMAGE]

[IMAGE]
[IMAGE]
WEIBULL (3-parameter) [IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]
WEIBULL2 (2-parameter) [IMAGE]

(known)

[IMAGE]
[IMAGE]
[IMAGE]
[IMAGE]

For the LOGNORMAL and WEIBULL2 options, you can specify the slope directly with the SLOPE= option. That is, for the LOGNORMAL option, when you specify THETA= [IMAGE] and SLOPE= [IMAGE], PROC UNIVARIATE displays the same line as that which is specified by THETA= [IMAGE] and ZETA= [IMAGE]. For the WEIBULL2 option, when you specify SIGMA= [IMAGE] and SLOPE= [IMAGE], PROC UNIVARIATE displays the same line when you specify SIGMA= [IMAGE] and C= [IMAGE]. Alternatively, you can request to use the estimated values of the parameters to determine the reference line.


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.