SAS Companion for the OpenVMS Operating Environment

Data Set I/O

The information that is presented in this section applies to reading and writing SAS data sets. In general, the larger your data sets, the greater the potential performance gain for your entire SAS job. The performance gains that are described here were observed on data sets of approximately 100,000 blocks.

Allocating Data Set Space Appropriately

Job type Jobs that write data sets.

User SAS programmer.

Usage Use ALQ=x and DEQ=y (or ALQMULT=x and DEQMULT=y) as LIBNAME statement options or as data set options, where x and y are values representing the number of blocks.

Benefit There is up to a 50-percent decrease in elapsed time on write operations as reflected in fewer direct I/Os. File fragmentation is also reduced, thereby enhancing performance when you read the data set.

Cost You will experience performance degradation when ALQ= or DEQ= values are incompatible with data set size

In Version 8, the SAS System initially allocates enough space for 10 pages of data for a data set. Each time the data set is extended, another 5 pages of space is allocated on the disk. OpenVMS maintains a bit map on each disk that identifies the blocks that are available for use. When a data set is written and then extended, OpenVMS alternates between scanning the bit map to locate free blocks and actually writing the data set. However, if the data sets were written with larger initial and extent allocations, then write operations to the data set would proceed uninterrupted for longer periods of time. At the hardware level, this means that disk activity is concentrated on the data set, and disk head seek operations that alternate between the bit map and the data set are minimized. The user sees fewer I/Os and faster elapsed time.

Large initial and extent values can also reduce disk fragmentation. SAS data sets are written using the RMS algorithm "contiguous best try." With large preallocation, the space is reserved to the data set and does not become fragmented as it does when inappropriate ALQ= and DEQ= values are used.

SAS Institute recommends setting ALQ= to the size of the data set to be written. If you are uncertain of the size, underestimate and use DEQ= for extents. Values of DEQ= larger than 5000 blocks are not recommended. For information about predicting data set size, see Estimating the Size of a SAS Data Set.

The following is an example of using the ALQ= and DEQ= options:

libname x '[]';
/* Know this is a big data set. */
data x.big (alq=100000 deq=5000);
   length a b c d e f g h i j k l m 
_n o p q r s t u v w x y z $200;
   do ii=1 to 13000;
   output;
end;
run;

Note: If you do not want to specify an exact number of blocks for the data set, use the ALQMULT= and DEQMULT= options. [cautionend]

References for Allocating Data Set Space

Data set option: ALQ=
Data set option: DEQ=
Data set option: ALQMULT=
Data set option: DEQMULT=
Guide to OpenVMS File Applications

Turning Off Disk Volume Highwater Marking

Job type Any SAS application that writes data sets. Data set size is not important.

User System manager.

Usage Use the /NOHIGHWATER_MARKING qualifier when initializing disks. For active disks, issue the DCL command SET VOLUME/NOHIGHWATER_MARKING.

Benefit There is a greater percentage gain for jobs that are write intensive. The savings in elapsed time can be as great as 40 percent. Direct I/Os are reduced.

Cost There is no performance penalty. However, for security purposes, some OpenVMS sites may require this OpenVMS highwater marking feature to be set.

Highwater marking is an OpenVMS security feature that is enabled by default. It forces prezeroing of disk blocks for files that are opened for random access. All SAS data sets are random-access files and, therefore, pay the performance penalty of prezeroing, increased I/Os, and increased elapsed time.

Two DCL commands can be used independently to disable highwater marking on a disk. When initializing a new volume, use the NOHIGHWATER_MARKING qualifier to disable the highwater function as in the following example:

$ initialize/nohighwater $DKA470 mydisk

To disable volume highwater marking on an active disk, use a command similar to the following:

$ set volume/nohighwater $DKA200

References for Turning Off Disk Volume Highwater Marking

OpenVMS System Manager's Manual: Tuning, Monitoring, and Complex Systems
OpenVMS DCL Dictionary A-M

Eliminating Disk Fragmentation

Job type Any jobs that frequently access common data sets.

User SAS programmer and system manager.

Usage Devote a disk to frequently accessed data sets, or keep your disks defragmented.

Benefit The savings in elapsed time varies with the current state of the disk, but it can exceed 50 percent on write operations and 25 percent on read operations.

Cost The cost to the user is the time and effort to better manage disk access. For the system manager, it can involve regularly defragmenting disks or obtaining additional disk drives.

Any software that reads and writes from disk benefits from a well-managed disk. This applies to SAS data sets. On an unfragmented disk, files are kept contiguous; thus, after one I/O operation, the disk head is well positioned for the next I/O operation.

A disk drive that is frequently defragmented can provide performance benefits. Use a frequently defragmented disk to store commonly accessed SAS data sets. In some situations, adding an inexpensive SCSI drive to the configuration allows the system manager to maintain a clean, unfragmented environment more easily than using a large disk farm. Data sets maintained on an unfragmented SCSI disk may perform better than heavily fragmented data sets on larger disks.

By defragmenting, we mean a process that runs the OpenVMS Backup Facility after regular business hours. SAS Institute does not recommend using dynamic defragmenting tools that run in the background of an active system because such programs can corrupt files.

Setting Larger Buffer Size for Sequential Write and Read Operations

Job type SAS steps that do sequential I/O operations on large data sets.

User SAS programmer.

Usage The CACHESIZ= data set option controls the buffering of data set pages during I/O operations. CACHESIZ= can be used either as a data set option or in a LIBNAME statement that uses the BASE engine. The BUFSIZE= data set option sets the data set page size when the data set is created. BUFSIZE= can be used as a data set option, in a LIBNAME statement, or as a SAS system option.

Benefit There is as much as a 30-percent decrease in elapsed time in some steps when an appropriate value is chosen for a particular data set.

Cost If the data set observation size is large, substantial space in the data set may be wasted if you do not choose an appropriate value for BUFSIZE=. Also, memory is consumed for the data cache, and multiple caches may be used for each data set opened.

Using the BUFSIZE= Option

The BUFSIZE= data set option sets the SAS internal page size for the data set. Once set, this becomes a permanent attribute of the file that cannot be changed. This option is meaningful only when you are creating a data set. If you do not specify a BUFSIZE= option, the SAS System selects a value that contains as many observations as possible with the least amount of wasted space.

An observation cannot span page boundaries. Therefore, unused space at the end of a page may occur unless the observations pack evenly into the page. By default, the SAS System tries to choose a page size between 8192 and 32768 if an explicit BUFSIZE= option has not been specified. If you increase the BUFSIZE= value, more observations can be stored on a page, and the same amount of data can be accessed with fewer I/Os. When explicitly choosing a BUFSIZE, be sure to choose a value that does not waste space in a data set page, resulting in wasted disk space. The highest recommended value for BUFSIZE= is 65024.

The following is an example of an efficiently written large data set, using the BUFSIZE= data set option. Note that in the following example, BUFSIZE=63488 becomes a permanent attribute of the data set:

libname buf '[]';
data buf.big (bufsize=63488);
   length a b c d e f g h i j k l m 
          n o p q r s t u v w x y z $200;
   do ii=1 to 13000;
   output;
end;
run;

Using the CACHENUM= Option

For each SAS file that you open, the SAS System maintains a set of caches to buffer the data set pages. The size of each of these caches is controlled by the CACHESIZ= option. The number of caches used for each open file is controlled by the CACHENUM= option. The ability to maintain more data pages in memory potentially reduces the number of I/Os that are required to access the data. The number of caches that are used to access a file is a temporary attribute. It may be changed each time you access the file.

By default, up to 10 caches are used for each SAS file that is opened; each of the caches is the value (in bytes) of CACHESIZ= in size. On a memory-constrained system you may wish to reduce the number of caches used in order to conserve memory.

The following example shows using the CACHENUM= option to specify that 8 caches of 65024 bytes each are used to buffer data pages in memory.

proc sort data=cache.big (cachesiz=65024 cachenum=8);
   by serial;
run;

Using the CACHESIZ= Option

The SAS System maintains a cache that is used to buffer multiple data set pages in memory. This reduces I/O operation by enabling SAS to read or write multiple pages in a single operation. SAS maintains multiple caches for each data set that is opened. The CACHESIZ= data set option specifies the size of each cache.

The CACHESIZ= value is a temporary attribute that applies only to the data set that is currently open. You can use a different CACHESIZ= value at different times when accessing the same file. To conserve memory, a maximum of 65024 bytes is allocated for the cache by default. The default allows as many pages as can be completely contained in the 65024-byte cache to be buffered and accessed with a single I/O. However, you can specify a CACHESIZ= value of up to 65024 bytes, the largest amount that can be accessed in a single I/O in an OpenVMS operating environment.

Here is an example that uses the CACHESIZ= data set option to write a large data set efficiently. Note that in the following example, CACHESIZ= value is not a permanent attribute of the data set:

libname cache '[]';
data cache.big (cachesiz=65024);
   length a b c d e f g h i j k l m 
          n o p q r s t u v w x y z $200;
   do ii=1 to 13000;
   output;
end;
run;

Using Asynchronous I/O When Processing SAS Data Sets

Job type Jobs that read or write SAS files.

User SAS programmer.

Usage The BASE engine now performs asynchronous reading and writing by default. This allows overlap between SAS data set I/O and computation time.
Note: Asynchronous reading and writing is enabled only if caching is turned on. [cautionend]

Benefit Asynchronous I/O allows other processing to continue while waiting on I/O completion. If there is a large gap between the CPU time used and the elapsed time reported in the FULLSTIMER statistics, asynchronous I/O can help reduce that gap.

Cost Because data page caching must be in effect, the memory usage of the I/O cache must be incurred. For more information about controlling the size and number of caches used for a particular SAS file, see the data set options CACHENUM= and CACHESIZ=.

In Version 8, asynchronous I/O is enabled by default. There are no additional options that need to be specified to use this feature. For all SAS files that use a data cache, SAS performs asynchronous I/O. Since multiple caches are now available for each SAS file, while an I/O is being performed on one cache of data, the SAS System may continue processing using other caches. For example, when SAS writes to a file, once the first cache becomes full an asynchronous I/O is initiated on that cache, but the SAS System does not have to wait on the I/O to complete. While that transaction is in progress, the SAS System can continue processing new data pages and store them in one of the other available caches. When that cache is full, an asynchronous I/O may be initiated on that cache as well.

Similarly, when SAS reads a file, additional caches of data may be read from the file asynchronously in anticipation of those pages being requested by the SAS System. When those pages are required, they will have already been read from disk, and no I/O wait need occur.

Because caching (with multiple caches) needs to be enabled in order for asynchronous I/O to be effective, if the cache is disabled with the CACHESIZ=0 option or the CACHENUM=0 option, no asynchronous I/O can occur.

References for Using Asynchronous I/O

Data set option: CACHENUM=
Data set option: CACHESIZ=

Chapter Contents
Previous
Next
Top of Page

Job type	Jobs that write data sets.
User	SAS programmer.
Usage	Use ALQ=x and DEQ=y (or ALQMULT=x and DEQMULT=y) as LIBNAME statement options or as data set options, where x and y are values representing the number of blocks.
Benefit	There is up to a 50-percent decrease in elapsed time on write operations as reflected in fewer direct I/Os. File fragmentation is also reduced, thereby enhancing performance when you read the data set.
Cost	You will experience performance degradation when ALQ= or DEQ= values are incompatible with data set size

Job type	Any SAS application that writes data sets. Data set size is not important.
User	System manager.
Usage	Use the /NOHIGHWATER_MARKING qualifier when initializing disks. For active disks, issue the DCL command SET VOLUME/NOHIGHWATER_MARKING.
Benefit	There is a greater percentage gain for jobs that are write intensive. The savings in elapsed time can be as great as 40 percent. Direct I/Os are reduced.
Cost	There is no performance penalty. However, for security purposes, some OpenVMS sites may require this OpenVMS highwater marking feature to be set.

Job type	Any jobs that frequently access common data sets.
User	SAS programmer and system manager.
Usage	Devote a disk to frequently accessed data sets, or keep your disks defragmented.
Benefit	The savings in elapsed time varies with the current state of the disk, but it can exceed 50 percent on write operations and 25 percent on read operations.
Cost	The cost to the user is the time and effort to better manage disk access. For the system manager, it can involve regularly defragmenting disks or obtaining additional disk drives.

Job type	SAS steps that do sequential I/O operations on large data sets.
User	SAS programmer.
Usage	The CACHESIZ= data set option controls the buffering of data set pages during I/O operations. CACHESIZ= can be used either as a data set option or in a LIBNAME statement that uses the BASE engine. The BUFSIZE= data set option sets the data set page size when the data set is created. BUFSIZE= can be used as a data set option, in a LIBNAME statement, or as a SAS system option.
Benefit	There is as much as a 30-percent decrease in elapsed time in some steps when an appropriate value is chosen for a particular data set.
Cost	If the data set observation size is large, substantial space in the data set may be wasted if you do not choose an appropriate value for BUFSIZE=. Also, memory is consumed for the data cache, and multiple caches may be used for each data set opened.

Job type	Jobs that read or write SAS files.
User	SAS programmer.
Usage	The BASE engine now performs asynchronous reading and writing by default. This allows overlap between SAS data set I/O and computation time. Note: Asynchronous reading and writing is enabled only if caching is turned on.
Benefit	Asynchronous I/O allows other processing to continue while waiting on I/O completion. If there is a large gap between the CPU time used and the elapsed time reported in the FULLSTIMER statistics, asynchronous I/O can help reduce that gap.
Cost	Because data page caching must be in effect, the memory usage of the I/O cache must be incurred. For more information about controlling the size and number of caches used for a particular SAS file, see the data set options CACHENUM= and CACHESIZ=.