Chapter Contents |
Previous |
Next |

The FREQ Procedure |

In addition to computation of exact *p*-values, PROC FREQ
provides the option of estimating exact *p*-values by Monte
Carlo simulation. This can be useful for
problems that are so large that exact computations require
a great amount of time and memory, but for which asymptotic
approximations may not be sufficient.

PROC FREQ provides exact *p*-values for the following tests
for two-way tables:
Pearson chi-square, likelihood-ratio chi-square, Mantel-Haenszel
chi-square, Fisher's exact test,
Jonckheere-Terpstra test, Cochran-Armitage test for trend, and
McNemar's test. PROC FREQ also computes exact *p*-values
for tests of hypotheses that the following statistics equal zero:
Pearson correlation coefficient, Spearman correlation coefficient,
simple kappa coefficient, and weighted kappa coefficient.
Additionally, PROC FREQ computes exact confidence limits for
the odds ratio for 2 × 2 tables.
For one-way frequency tables, PROC FREQ provides the exact
chi-square goodness-of-fit test (for equal proportions or for
proportions or frequencies that you specify). Also for one-way
tables, PROC FREQ provides exact confidence limits for the
binomial proportion and an exact test for the binomial proportion
value.

The following sections summarize the exact computational
algorithms, define the exact *p*-values that PROC FREQ
computes, discuss the computational resource
requirements, and describe the Monte Carlo estimation option.

The reference set for a given contingency table is the set of all contingency tables with the observed marginal row and column sums. Corresponding to this reference set, the network algorithm forms a directed acyclic network consisting of nodes in a number of stages. A path through the network corresponds to a distinct table in the reference set. The distances between nodes are defined so that the total distance of a path through the network is the corresponding value of the test statistic. At each node, the algorithm computes the shortest and longest path distances for all the paths that pass through that node. For statistics that can be expressed as a linear combination of cell frequencies multiplied by increasing row and column scores, PROC FREQ computes shortest and longest path distances using the algorithm given in Agresti, Mehta, and Patel (1990). For statistics of other forms, PROC FREQ computes an upper bound for the longest path and a lower bound for the shortest path, following the approach of Valz and Thompson (1994).

The longest and shortest path distances or bounds for a node
are compared to the value of the test statistic to determine
whether all paths through the node contribute to the *p*-value,
none of the paths through the node contribute to the *p*-value,
or neither of these situations occur. If all paths through the node
contribute, the *p*-value is incremented accordingly, and these
paths are eliminated from further analysis. If no paths contribute, these
paths are eliminated from the analysis. Otherwise, the algorithm
continues, still processing this node and the associated paths. The
algorithm finishes when all nodes have been accounted for, incrementing
the *p*-value accordingly, or eliminated.

In applying the network algorithm, PROC FREQ uses full
precision to represent all statistics, row and column scores, and other
quantities involved in the computations. Although it is possible to
use rounding to improve the speed and memory requirements of the
algorithm, PROC FREQ does not do this since it can result in reduced
accuracy of the *p*-values.

PROC FREQ computes exact confidence limits for the odds ratio according to an iterative algorithm based on that presented by Thomas (1971). Refer also to Gart (1971). Because this is a discrete problem, the confidence coefficient is not exactly , but it is at least . Thus, these confidence limits are conservative.

For one-way tables, PROC FREQ computes the exact chi-square
goodness-of-fit test by the method of Radlow and Alf (1975).
PROC FREQ generates all possible one-way tables with the
observed total sample size and number of categories. For
each possible table, PROC FREQ compares its chi-square
value with the value for the observed table. If the table's
chi-square value is greater than or equal to the observed
chi-square, PROC
FREQ increments the exact *p*-value by the probability
of that table, which is calculated under the null hypothesis using
the multinomial frequency distribution. By default, the null
hypothesis states that all categories have equal proportions.
If you specify null hypothesis proportions or frequencies using
the TESTP= or TESTF= option in the TABLES statement, then
PROC FREQ calculates the exact chi-square test based on that
null hypothesis.

For binomial proportions in one-way tables, PROC FREQ computes
exact confidence limits using the *F* distribution method given
in Collett (1991) and also described by Leemis and Trivedi (1996).
PROC FREQ computes the exact test for a binomial proportion
(*H*0: *p* = *p _{0}*) by summing binomial probabilities over all
alternatives. See the section "Binomial Proportion" for details.
By default, PROC FREQ uses

There are other tests where it may be appropriate to test against
either a one-sided or a two-sided alternative hypothesis. For
example, when you test the null hypothesis that the true parameter
value equals 0 (*T* = 0), the alternative of interest
may be one-sided (, or ) or two-sided ().
Such tests include the Pearson correlation coefficient, Spearman
correlation coefficient, Jonckheere-Terpstra test, Cochran-Armitage
test for trend, simple kappa coefficient, and weighted kappa
coefficient. For these tests, PROC FREQ outputs the right-sided
*p*-value when the observed value of the test statistic is
greater than its expected value. The right-sided *p*-value
is the sum of probabilities for those tables having a test
statistic greater than or equal to the observed test statistic. Otherwise,
when the test statistic is less than or equal to its expected value, PROC
FREQ outputs the left-sided *p*-value. The left-sided *p*-value
is the sum of
probabilities for those tables having a test statistic less than
or equal to the one observed. The one-sided *p*-value
*P _{1}* can be expressed as

where *t* is the observed value of the test statistic and *E _{0}*(

A formula does not exist that can
predict in advance how much time and memory
are needed to compute an exact *p*-value for a
certain problem. The time and memory required depend on
several factors, including which test is being performed,
the total sample size, the number of rows and columns,
and the specific arrangement of the observations into
table cells. Generally, larger problems
(in terms of total sample size, number of rows, and number of
columns) tend to require more time and memory. Additionally,
for a fixed total sample size, time and memory requirements tend
to increase as the number of rows and columns increases,
since this corresponds to an increase in the number of
tables in the reference set. Also for a fixed sample size,
time and memory requirements increase as the marginal
row and column totals become more homogeneous. Refer to
Agresti, Mehta, and Patel (1990) and Gail and Mantel (1977).

At any time while PROC FREQ is computing exact *p*-values,
you can terminate the computations by pressing the system
interrupt key sequence (refer to the *SAS Companion* for your system)
and choosing to stop computations. After you terminate exact
computations, PROC FREQ completes all other remaining tasks.
The procedure produces the requested output and reports missing
values for any exact *p*-values that were not computed by the time of
termination.

You can also use the MAXTIME= option in the EXACT statement to
limit the amount of time PROC FREQ uses for exact computations.
You specify a MAXTIME= value that is the maximum amount of
clock time (in seconds) that PROC FREQ can use to compute an
exact *p*-value. If PROC FREQ does not finish computing
an exact *p*-value within that time, it terminates the
computation and completes all other remaining tasks.

To compute a Monte Carlo estimate of an exact *p*-value, PROC
FREQ generates a random sample of tables with the same total
sample size, row totals, and column totals as the observed table.
PROC FREQ uses the algorithm of Agresti, Wackerly, and Boyett (1979),
which generates tables in proportion to their hypergeometric
probabilities conditional on the marginal frequencies.
For each sample table, PROC FREQ computes the value of the test
statistic and compares it to the value for the observed table.
When estimating a right-sided *p*-value, PROC FREQ counts all
sample tables for which the test statistic is greater than or
equal to the observed test statistic. Then the *p*-value
estimate equals the number of these tables divided by the total
number of tables sampled.

PROC FREQ computes left-sided and two-sided *p*-value estimates
in a similar manner. For left-sided *p*-values, PROC FREQ evaluates
whether the test statistic for each sampled table is less than
or equal to the observed test statistic. For two-sided *p*-values,
PROC FREQ examines the sample test statistics according to the
expression for *P _{2}* given in the
"Asymptotic Tests" section.
The variable

PROC FREQ constructs asymptotic confidence limits for the
*p*-values according to

When the Monte Carlo estimate equals 0, then PROC
FREQ computes the confidence limits for the *p*-value as

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.