Chapter Contents

Previous

Next
SAS/SPECTRAVIEW Software User's Guide

Loading the Data

The first step in the visualization process is selecting and reading your data into SAS/SPECTRAVIEW. The interface guides you through the process.

When you first invoke SAS/SPECTRAVIEW, [Data] is selected by default, ready for you to load data. Note that you can load data at any time during a SAS/SPECTRAVIEW session by reselecting [Data].

Loading Data

[IMAGE]


Selecting a Libref

To display the session's assigned librefs:

  1. Select [Load data]. The software displays the assigned librefs under the label Libname. To assign an additional libref for a session, you can do so from the SAS PROGRAM EDITOR window (if you invoked SAS/SPECTRAVIEW with a command), then refresh the session's librefs for SAS/SPECTRAVIEW by reselecting the Data and Load data buttons.

  2. Select the libref containing the data set that you want to load. Use the scroll bar if there are more than 10. Once you select the libref, the software displays the data sets associated with the libref.

Selecting a Libref

[IMAGE]


Selecting a Data Set

SAS/SPECTRAVIEW works as well with small data sets (such as 20 observations) as it does with large data sets (such as a quarter million observations). The SAS data set that you select must have at least four variables to be specified for the three axis variables and the response variable, the response variable must be numeric, and each variable specified for SAS/SPECTRAVIEW must contain at least two unique values. If you want to use a BY variable, the data set must have a fifth variable as well. To load a data set that has only three variables, see Loading a Data Set with Only Three Variables.

Select the input data set from the list of names. Use the scroll bar if there are more than 10. Once you select the input data set, the software lists the data set's variables in columns from which you can select SAS/SPECTRAVIEW variables.

Selecting a Data Set

[IMAGE]


Specifying SAS/SPECTRAVIEW Variables

You must specify a different data set variable for each SAS/SPECTRAVIEW variable. That is, you must select a different variable from each of the X Variable, Y Variable, Z Variable, and Response variable columns. The axis variables can be either numeric or character, but the response variable must be numeric.

To help you select appropriate variables, you can place your cursor on a variable name, and the software will display a short description of it in the text window. For example, for the EPA data set, which contains the variables HOUR, LEVEL, LNGITUDE, LATITUDE, SULFATE, and OZONE, their descriptions provide the following information:

Note that any variable that is appropriate as a Response variable is not a valid choice as an axis variable, and any variable that is appropriate as an axis variable is not a valid choice for a Response variable. Attempting to read a data set with inappropriate variables selected could result in the data set failing to load. You want to specify variables that are the best ones as the axis variables to build as complete a volume grid with actual data points as possible. And you want to avoid specifying axis variables that are sparsely valued or have continuous data.

Specifying SAS/SPECTRAVIEW Variables

[IMAGE]

Once you select the four required variables, the software highlights [Read data], but you still have the option of specifying BY variable processing, duplicate values handling, data categorizing, automatic axis scaling, and data subsetting with a WHERE clause, which are discussed in the following sections.


Grouping Observations with a BY Variable

In addition to the four required variables, you have the option of specifying a fifth variable as a BY variable. The values of a BY variable define groups of observations, such as hour, month, or year. Specifying a BY variable allows you to animate an image so that you can see how response values change according to some grouping, like over time.

A BY variable can be either character or numeric. BY data usually includes multiple response values for a single data point.

For example, in the EPA data set, the variable HOUR contains hour values, which would be useful as a BY variable. If you imagine that the first four variables would generate a cube of data values, then specifying a BY variable would generate a sequence of cubes of data values that can be cycled through to determine how response values change over time (in this case).

If you select LNGITUDE, LATITUDE, and LEVEL as the axis variables, SULFATE as the Response variable, then HOUR as the BY variable, you will create a sequence of volumes of data to be displayed and analyzed.

Specifying a BY Variable

[IMAGE]

Note:   If you do not specify a BY variable but your data contains BY data (like a time variable), you may receive a message in the text window after loading the data. The message warns that there is more than one response value for an x,y,z coordinate. When this occurs, the software handles the response values according to the setting on the Duplicate Values panel.  [cautionend]


Handling Duplicate Values

Duplicate values occur when the data has more than one observation for the same x,y,z coordinate, which could result in more than one response value for a data point. Note that if you also categorize the data or if you have specified a BY variable, the instances of duplicate values may increase.

You determine how the software handles duplicate values by selecting one of the choices under the label Duplicate Values. The default is [Last], which means that the last response value encountered for a data point is used as that location's response value.

Handling Duplicate Values

[IMAGE]

To specify how the software handles duplicate values, select one of the following options:

[Count]
For each unique x,y,z location, the software counts the number of observations and uses that count as the response value. For example, if there are three observations that specify the x,y,z location 1,1,1, the response value is 3, regardless of the actual response values in the data.

When you load data, each response value for the resulting data points represents a count of the observations for that location. If there are no duplicate observations for a particular x,y,z location, the response value is 1, indicating that only one observation was found for that location. Similarly, if the data includes no observations for a particular x,y,z location, the response value would be 0, meaning that the data point is missing. [Count] allows you to find the number of response values that were used to calculate other values, for example, [Mean] or [Sum]. If you load data with [Mean], you may want to know how many values were used to calculate the mean value shown at a particular x,y,z location. You can load again using [Count], then probe the data to reveal the number used for the mean.

[Nmiss]
For each unique x,y,z location, the software counts the number of observations with missing response values. For example, if an x,y,z location has two observations and both have a valid response value, the result is a response value of 0, meaning no observations with a missing response value were found for that location.

With [Nmiss] specified, every data point has a response value indicating how many missing response values were encountered for that location. If a valid data point has five observations and only three had response values, then that data point's response value is 2, meaning two observations were found missing a response value for that location. [Nmiss] only counts valid data points having no response value. It does not count filler points generated by the software. If the data does not contain an observation for an x,y,z location, the software inserts a data point that has a missing response value. This means that if you load a data set, display it as a point cloud, and discover there are several missing values in the volume grid, you can reload the data with [Nmiss] selected and determine which missing values are caused by missing response values as opposed to missing axis values.

[Minimum]
If there are two or more response values for the same x,y,z location, the software uses the minimum value as the response value.

[Maximum]
If there are two or more response values for the same x,y,z location, the software uses the maximum value as the response value.

[Sum]
If there are two or more response values for the same x,y,z location, the software uses the sum as the response value.

[Range]
If the data contains at least two response values for each x,y,z location, the software uses the range as the response value. The range is calculated by subtracting the minimum response value from the maximum response value. If there is only one value for a location, the response value is set to missing.

[Last]
If the data contains two or more response values for the same x,y,z location, the software uses the last response value as the response value. This is the default.

[Mean]
If the data contains two or more response values for the same x,y,z location, the software uses the mean as the response value.


Categorizing Data

Categorizing data is an option that groups numeric data to create distinct ranges (called categories) for each axis. You cannot categorize character variables. The result is a reduced number of data points in the volume grid. By categorizing all three axes, you can set exactly how many data points the software will create. Categorizing data is useful

Continuous data (containing few gaps that vary slightly over a large range like weight and height) are a good candidate for categorizing. For example, to analyze a group of people's heart rate based on their age, activity level, and weight, the weight values, which would be in pounds like 139.5, 143.6, would be considered continuous. That is, it is not likely that any two people (let alone several) would have the same weight but a different age and activity level. Categorizing the weight values by creating weight categories for ranges of weight with one value to represent each category would make the data clearer and easier to use.

Discrete data (containing natural gaps like patient IDs and years) would probably not be as useful to categorize. But discrete data such as hour could be categorized into groups if the degree of precision can be reduced without losing data integrity.

To categorize data:

  1. Select [Categorize]. The software displays a group of sliders and buttons at the bottom of the interface.

    Categorizing Data

    [IMAGE]

  2. Under the label CATEGORIZE AXIS, specify which axis you want to categorize. By default, all three are turned on for categorizing. Use the on/off buttons to turn categorizing on or off for a particular axis. For example, selecting [X on] turns on categorizing for the X axis, and selecting [Y off] turns off categorizing for the Y axis.

  3. Under NUMBER GROUPS, use the sliders to specify the number of categories you want for each axis. You can specify between two and 100 categories for each, with 10 being the default.

  4. Under GROUP AXIS VALUE, for each categorized axis, specify the axis tick mark value:
    [Lower] Uses the lower bound value in each range.
    [Midpoint] Uses the midpoint value in each range. This is the default setting.
    [Upper] Uses the upper bound value in each range.
    [Bounds] Uses both the upper and lower bound values in each range. The values display as a range, for example, 125.1-225.1 for each major tick mark.


Effect on Duplicate Values Handling

Categorizing data makes it more likely that the software encounters more than one response value for a given x,y,z coordinate. (Uncategorized data usually contain only one response value for each x,y,z coordinate.) When one or more of the axes are categorized, some of the data points become duplicates within a group, which could result in more than one response value for a single data point.

For example, suppose values for the X variable are integers from 1 to 100. If you categorize the X values into groups of 10 values, 1-10 would be a single category. The data points 1,1,1 and 2,1,1 and 3,1,1 and so forth are viewed by the software as the same data point in the volume grid, because they would all have the same X, Y, and Z values.

The response values for the 10 data points would appear to be 10 different response values for the same data point. The response values for the duplicate locations are handled according to the method specified for duplicate values handling, with the default being to use the last response value found as the category's response value.


Automatically Scaling Axes

By selecting [Auto scale], you can automatically scale the volume's three axes to the same length. The default is that the length of each axis is determined by the range of axis values. For example, an axis with values from 1 to 100 is ten times as long as an axis with values from 1 to 10.

Note:   Once a data set is loaded, [Auto scale] is deselected. To load a subsequent data set with automatic scaling, you must select [Auto scale] again.  [cautionend]


Subsetting Data with a WHERE Clause

Optionally, you can specify a subset of data to be loaded into SAS/SPECTRAVIEW by specifying condition(s) that observations must meet. You can subset response values by specifying criteria for the response variable, and you can subset data points by specifying criteria for the axis variables.

Subsetting can change the size and shape of the volume grid. For example, subsetting data can create holes that are replaced with filler points, or subsetting can remove holes in data.

Prior to selecting [Read data], you can specify subsetting conditions using a SAS WHERE clause:

  1. Select [Where clause].

  2. In the text window, type a SAS WHERE clause, without the keyword WHERE and no ending semicolon. A condition consists of a variable name, an operator (such as EQ, NE, LT), and a value, such as sulfate > .00005060.

    Subsetting Data

    [IMAGE]

  3. Press Enter.

For details on specifying conditions, see the appropriate WHERE clause documentation. Note that before you invoke SAS/SPECTRAVIEW, you can create a smaller SAS data set containing only the values that you want to use. For example, you could choose certain ranges of axis values or specific response values.


Reading the Data Set

To have the software read the data, select [Read data].

The software loads the input data, applying any optional specifications. For example, if a WHERE clause is specified, the software loads only those observations meeting the criteria, and if categorizing is specified, the software changes the number of data points accordingly. Once the data set is loaded, the variable list disappears, and the software is ready for you to

If you have loading problems, see Resolving Data Loading Problems.


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.