Chapter Contents

Previous

Next
Using Spatial Data with SAS/GIS Software

How Batch Geocoding Works

To achieve the most accurate geocoding, ensure that the address data set to be geocoded contains name, address, city, state, ZIP code, and ZIP+4 variables. At least the address and city variables are required.

The geocoding facility first takes the chains, nodes, and details data sets and creates new data sets for the sorted and summarized versions in the SAS data library that was specified with the GLIB macro variable. Names for the geocoding data sets are generated from the specified chains data set name. For example, if your chains data set is GMAPS.USAC and you specify GLIB=GEOLIB in the %GCBATCH macro, then the geocoding facility creates the following data sets:
GEOLIB.USAS (Sorted chains)
GEOLIB.USAM (Summarized chains)
GEOLIB.USAP (Detail points and nodes)

The geocoding facility uses these data sets to match the addresses in the address data set. As it is processing the address data set, the geocoding facility provides a progress indicator. For every 10 percent of the addresses that are geocoded, a message is written to the SAS log.

When a match is found, the coordinates of the address location are added to the address data set, along with any other composite values for the specified address. For example, if the spatial data have a composite named TRACT that contains census tract numbers, you can use the geocoding process to add a TRACT variable to your address data set. The resulting geocoded address data set can be used as attribute data for the map, or it can be imported to add point data to the map by using a generic import.

If an address cannot be matched to the spatial data but the address includes a ZIP code, the X and Y coordinates of the center of the ZIP code centroid for that zone are returned instead of the exact coordinates of the address.

For matching purposes, the geocoding process converts the address components to uppercase and attempts to convert direction and street type values to standard forms. The standardized versions of the address components are also added to the address data set. The M_ADDR, M_CITY, M_STATE, M_ZIP, M_ZIP4 variables that are added to the address data set reflect the address values that were actually matched during the geocoding process.

The geocoding process also adds _SCORE_ and _STATUS_ variables to the address data set. The _SCORE_ variable's value indicates the reliability of the address match. The score is calculated by adding points for matching various components of the address, as follows:

Matching Characteristic Points
Street number 40
Street name 20
Street type 5
Street direction 5
City 5
State 5
ZIP code 15 (or 5 if only the first three digits match)
ZIP+4 code 5

A score of 100 indicates that a match was found for all of the components of the address. A score of 100 is possible only if the address in the data set includes values for all components and the spatial database has composites for all components. For example, if the address in the data set does not have a ZIP+4 value or if the spatial database does not have a composite of class PLUS4, then the highest possible score is 95.

The _STATUS_ variable can contain values such as the following:


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.