Chapter Contents

Previous

Next
The SQL Procedure

summary-function


Performs statistical summary calculations.

Restriction: A summary function cannot appear in an ON clause or a WHERE clause.
See also: GROUP BY , HAVING Clause , SELECT Clause , and table-expression
Featured in: Creating a View from a Query's Result , Joining Two Tables and Calculating a New Value , and Counting Missing Values with a SAS Macro


summary-function (<DISTINCT|ALL> sql-expression)


Summarizing Data
Summary functions produce a statistical summary of the entire table or view listed in the FROM clause or for each group specified in a GROUP BY clause. If GROUP BY is omitted, all the rows in the table or view are considered to be a single group. These functions reduce all the values in each row or column in a table to one summarizing or aggregate value. For this reason, these functions are often called aggregate functions. For example, the sum (one value) of a column results from the addition of all the values in the column.


Function Names and the Corresponding Statistics
Some functions have more than one name to accommodate both SAS and SQL conventions:

AVG, MEAN
means or average of values

COUNT, FREQ, N
number of nonmissing values

CSS
corrected sum of squares

CV
coefficient of variation (percent)

MAX
largest value

MIN
smallest value

NMISS
number of missing values

PRT
probability of a greater absolute value of Student's t

RANGE
range of values

STD
standard deviation

STDERR
standard error of the mean

SUM
sum of values

SUMWGT
sum of the WEIGHT variable values(footnote 1)

T
Student's t value for testing the hypothesis that the population mean is zero

USS
uncorrected sum of squares

VAR
variance

For a description and the formulas used for these statistics, see SAS Elementary Statistics Procedures


Counting Rows
The COUNT function counts rows. COUNT(*) returns the total number of rows in a group or in a table. If you use a column name as an argument to COUNT, the result is the total number of rows in a group or in a table that have a nonmissing value for that column. If you want to count the unique values in a column, specify COUNT(DISTINCT column).

If the SELECT clause of a table-expression contains one or more summary functions and that table-expression resolves to no rows, then the summary function results are missing values. The following are exceptions that return zeros:
COUNT(*)
COUNT(<DISTINCT> sql-expression)
NMISS(<DISTINCT> sql-expression)

See Creating a View from a Query's Result and Counting Missing Values with a SAS Macro for examples.


Calculating Statistics Based on the Number of Arguments
The number of arguments specified in a summary function affects how the calculation is performed. If you specify a single argument, the values in the column are calculated. If you specify multiple arguments, the arguments or columns listed are calculated for each row. For example, consider calculations on the following table.

proc sql;
   title 'Summary Table';
   select * from summary;
[HTML Output]  [Listing Output]

If you use one argument in the function, the calculation is performed on that column only. If you use more than one argument, the calculation is performed on each row of the specified columns. In the following PROC SQL step, the MIN and MAX functions return the minimum and maximum of the columns they are used with. The SUM function returns the sum of each row of the columns specified as arguments:

proc sql;
    select min(x) as Colmin_x, 
           min(y) as Colmin_y, 
           max(z) as Colmax_z, 
           sum(x,y,z) as Rowsum
       from summary;
[HTML Output]  [Listing Output]


Remerging Data
When you use a summary function in a SELECT clause or a HAVING clause, you may see the following message in the SAS log:

NOTE: The query requires remerging summary 
      statistics back with the original
      data.

The process of remerging involves two passes through the data. On the first pass, PROC SQL

On the second pass, PROC SQL retrieves any additional columns and rows that it needs to show in the output.

The following examples use the PROCLIB.PAYROLL table (shown in Creating a Table from a Query's Result ) to show when remerging of data is and is not necessary.

The first query requires remerging. The first pass through the data groups the data by Jobcode and resolves the AVG function for each group. However, PROC SQL must make a second pass in order to retrieve the values of IdNumber and Salary.

proc sql outobs=10;
   title 'Salary Information';
   title2 '(First 10 Rows Only)';
   select  IdNumber, Jobcode, Salary, 
           avg(salary) as AvgSalary
      from proclib.payroll
      group by jobcode;
[HTML Output]  [Listing Output] You can change the previous query to return only the average salary for each jobcode. The following query does not require remerging because the first pass of the data does the summarizing and the grouping. A second pass is not necessary.
proc sql outobs=10;
   title 'Average Salary for Each Jobcode';
   select Jobcode, avg(salary) as AvgSalary
   from proclib.payroll
   group by jobcode;
[HTML Output]  [Listing Output] When you use the HAVING clause, PROC SQL may have to remerge data to resolve the HAVING expression.

First, consider a query that uses HAVING but that does not require remerging. The query groups the data by values of Jobcode, and the result contains one row for each value of Jobcode and summary information for people in each Jobcode. On the first pass, the summary functions provide values for the Number, Average Age, and Average Salary columns. The first pass provides everything that PROC SQL needs to resolve the HAVING clause, so no remerging is necessary.

proc sql outobs=10;
title 'Summary Information for Each Jobcode';
title2 '(First 10 Rows Only)';
   select Jobcode,
          count(jobcode) as number 
              label='Number',
          avg(int((today()-birth)/365.25))
              as avgage format=2. 
              label='Average Age',
          avg(salary) as avgsal format=dollar8.
              label='Average Salary'
      from proclib.payroll
      group by jobcode
      having avgage ge 30;
[HTML Output]  [Listing Output] In the following query, PROC SQL remerges the data because the HAVING clause uses the SALARY column in the comparison and SALARY is not in the GROUP BY clause.
proc sql outobs=10;
title 'Employees who Earn More than the';
title2 'Average for Their Jobcode';
title3 '(First 10 Rows Only)';
   select Jobcode, Salary, 
          avg(salary) as AvgSalary
      from proclib.payroll
      group by jobcode
      having salary > AvgSalary;
[HTML Output]  [Listing Output] Keep in mind that PROC SQL remerges data when


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.