RXPARSE

RXPARSE

Parses a pattern and returns a value

Category: Character String Matching

Syntax
Syntax Description
Arguments
Character Classes
	Default Character Classes
	User-defined Character Classes
	Character Class Complements
	Reusing Character Classes
Pattern Abbreviations
Matching Balanced Symbols
Special Symbols
Scores
Tag Expression
Change Expressions
Change Items
Details
	General Information
	Using Quotation Marks in Expressions
Comparisons
Example
See Also

Syntax

rx=RXPARSE(pattern-expression)

Syntax Description

Arguments

rx

specifies a numeric value that is passed to other regular expression (RX) functions and call routines.

pattern-expression

specifies a character constant, variable, or expression whose value is a literal or a pattern expression. A pattern-expression is composed of the following elements:

string-in-quotation-marks

matches a substring consisting of the characters in the string.

letter

matches the upper- or lowercase letter in a substring.

digit

matches the digit in a substring.

period (.)

matches a period (.) in a substring.

underscore (_)

matches an underscore (_) in a substring.

?

matches any one character in a substring.

colon (:)

matches any sequence of zero or more characters in a substring.

$'pattern' or $"pattern"

matches any one character in a substring.

Tip:

Ranges of alphanumeric variables are indicated by the hyphen (-).

Example:

To match any lowercase letter, use

rx=rxparse("$'a-z'");

See:

User-defined Character Classes

~'character-class' or ^'character-class' or ~"character-class" or ^"character-class"

matches any one character that is not matched by the corresponding character class.

Tip:

Ranges of alphanumeric variables are indicated by a hyphen (-).

Example:

To exclude the letters a-d from the match, use

rx=rxparse("^'a-d'");

See:

Character Class Complements

pattern1 pattern2 or pattern1 || pattern2

selects any substring matched by pattern1 followed immediately by any substring matched by pattern2 (with no intervening blanks).

pattern1 | pattern2

selects any substring matched by pattern1 or any substring matched by pattern2.

Tip:	You can use an exclamation point (!) instead of a vertical bar (\|).

(pattern)

matches a substring that contains a pattern. You can use parentheses to indicate the order in which operations are performed.

[pattern] or {pattern}

matches a substring that contains a pattern or null string.

pattern*

matches zero or more consecutive strings matched by a pattern.

pattern+

matches one or more consecutive strings matched by a pattern.

@int

matches the position of a variable if the next character is located in the column specified by int. @0 matches end-of-line. If int is negative, it matches -int positions from end-of-line.

reuse-character-class

reuses a character-class you previously defined.

See:	Reusing Character Classes

pattern-abbreviaton

specifies ways to shorten pattern representation.

See:	Pattern Abbreviations, Default Character Classes

balanced-symbols

specifies the number of nested parentheses, brackets, braces, or less-than/greater-than symbols in a mathematical expression.

See:	Matching Balanced Symbols

special-symbol

specifies a position in a string, or a score value.

See:	Special Symbols

score-value

selects the pattern with the highest score value.

See:

Scores

<pattern>

retrieves a matched substring for use in a change expression.

See:	Tag Expression

change-expression

specifies a pattern change operation that replaces a string containing a matched substring by concatenating values to the replacement string.

See:	Change Expressions

change-item

specifies items used for string manipulation.

See:	Change Items

Character Classes

Using a character class element is a shorthand method for specifying a range of values for matching. In pattern matching, you can

use default character classes
define your own character classes
use character class complements
reuse character classes.

You specify a default character class with a dollar sign ($) followed by a single upper- or lowercase letter. In the following list, the character class is listed in the left column and the definition is listed in the right column.

$a or $A matches any alphabetic upper- or lowercase letter in a substring ($'a-zA-Z').

$c or $C matches any character allowed in a version 6 SAS name that is found in a substring ($'0-9a-zA-Z_').

$d or $D matches any digit in a substring ($'0-9').

$i or $I matches any initial character in a version 6 SAS name that is found in a substring ($'a-zA-Z_').

$l or $L matches any lowercase letter in a substring ($'a-z').

$u or $U matches any uppercase letter in a substring ($'A-Z').

$w or $W matches any white space character, such as blank, tab, backspace, carriage return, etc., in a substring.

See also: Character Class Complements

Note: A hyphen appearing at the beginning or end of a character class is treated as a member of the class rather than as a range symbol. [cautionend]

This statement and these values produce these matches.

rx=rxparse("$character-class");

Pattern Input string Position of match Value of match

$L or $l

3+Y STRIkeS
9
k

$U or $u

0*5x49XY
7
X (uppercase)

Pattern	Input string	Position of match	Value of match
$L or $l	3+Y STRIkeS	9	k
$U or $u	0*5x49XY	7	X (uppercase)

The following example shows how to use a default character class in a DATA step.

data _null_;
   stringA='3+Y STRIkeS';
   rx=rxparse("$L");
   matchA = rxmatch(rx,stringA);
   valueA=substr(stringA,matchA,1);
   put 'Example A: ' matchA = valueA= ;
run;

data _null_;
   stringA2='0*5x49XY';
   rx=rxparse("$u");
   matchA2 = rxmatch(rx,stringA2);
   valueA2 = substr(stringA2, matchA2,1);
   put 'Example A2: ' matchA2 = valueA2= ;
run;

The SAS log shows the following results:

Example A: matchA=9 valueA=k
Example A2: matchA2=7 valueA2=X

User-defined Character Classes

A user-defined character class begins with a dollar sign ($) and is followed by a string in quotation marks. A character class matches any one character within the quotation marks.

Note: Ranges of values are indicated by a hyphen (-). [cautionend]

This statement and these values produce these matches.

rx=rxparse("$'pattern'");

Pattern Input string Position of match Value of match

$'abcde'

3+yE strikes
11
e

$'1-9'

z0*549xy
4
5

Pattern	Input string	Position of match	Value of match
$'abcde'	3+yE strikes	11	e
$'1-9'	z0*549xy	4	5

The following example shows how to use a user-defined character class in a DATA step.

data _null_;
   stringB='3+yE strikes';
   rx=rxparse("$'abcde'");
   matchB = rxmatch(rx,stringB);
   valueB=substr(stringB,matchB,1);
   put 'Example B: ' matchB= valueB= ;
run;

data _null_;
   stringB2='z0*549xy';
   rx=rxparse("$'1-9'");
   matchB2=rxmatch(rx,stringB2);
   valueB2=substr(stringB2,matchB2,1);
   put 'Example B2: ' matchB2= valueB2= ;
run;

The SAS log shows the following results:

Example B: matchB=11 valueB=e
Example B2: matchB2=4 valueB2=5

You can also define your own character class complements. For details about character class complements, see Character Class Complements.

Character Class Complements

A character class complement begins with a caret (^) or a tilde (~) and is followed by a string in quotation marks. A character class complement matches any one character that is not matched by the corresponding character class. For details about character classes, see Character Classes.

This statement and these values produce these matches.

rx=rxparse(^character-class | ~character-class);

Pattern Input string Position of match Value of match

^u or ~u

0*5x49XY
1
0

^'A-z' or ~'A-z'

Abc de45
4
the first space

Pattern	Input string	Position of match	Value of match
^u or ~u	0*5x49XY	1	0
^'A-z' or ~'A-z'	Abc de45	4	the first space

The following example shows how to use a character class complement in a DATA step.

data _null_;
   stringC='0*5x49XY';
   rx=rxparse('^u');
   matchC = rxmatch(rx,stringC);
   valueC=substr(stringC,matchC,1);
   put 'Example C: ' matchC = valueC=;
run;

data _null_;
   stringC2='Abc de45';
   rx=rxparse("~'A-z'");
   matchC2=rxmatch(rx,stringC2);
   valueC2=substr(stringC2,matchC2,1);
   put 'Example C2: ' matchC2= valueC2= ;
run;

The SAS log shows the following results:

Example C: matchC=1 valueC=0
Example C2: matchC2=4 valueC2=

Reusing Character Classes

You can reuse character classes you previously defined by using one of the following patterns:

$int

reuses the int^th character class.

Restriction:

int is a nonzero integer.

Example:

If you defined a character class in a pattern and want to use the same character class again in the same pattern, use $int to refer to the int^th character class you defined. If int is negative, count backwards from the last pattern to identify the character class for -int. For example,

rx=rxparse("$'AB' $1 $'XYZ' $2 $-2");

is equivalent to

rx=rxparse("$'AB' $'AB' $'XYZ' $'XYZ' $'AB'");

The $1 element in the first code sample is replaced by AB in the second code sample, because AB was the first pattern defined.
The $2 element in the first code sample is replaced by XYZ in the second code sample, because XYZ was the second pattern defined.
The $-2 element in the first code sample is replaced by AB in the second code sample, because AB is the second-to-the-last pattern defined.

~int or ^int

reuses the complement of the int'th character class.

Restriction:

int is a nonzero integer.

Example:

This example shows character-class elements ($'Al', $'Jo', $'Li') and reuse numbers ($1, $2, $3, ~2):

rx=rxparse($'Al' $1 $'Jo' $2 $'Li' $3 ~2);

is equivalent to

rx=rxparse($'Al' $'Al' $'Jo' $'Jo' 
           $'Li' $'Li' $'Al' $'Li');

The ~2 matches patterns 1 (Al) and 3 (Li), and excludes pattern 2 (Jo).

Pattern Abbreviations

You can use the following list of elements in your pattern:

$f or $F matches a floating point number.

$n or $N matches a SAS name.

$p or $P indicates a prefix option.

$q or $Q matches a string in quotation marks.

$s or $S indicates a suffix option.

This statement and input string produce these matches.

rx=rxparse($pattern-abbreviation pattern);

Pattern Input string Position of match Value of match

$p wood

woodchucks eat wood
1
characters "wood" in woodchucks

wood $s

woodchucks eat wood
20
wood

Pattern	Input string	Position of match	Value of match
$p wood	woodchucks eat wood	1	characters "wood" in woodchucks
wood $s	woodchucks eat wood	20	wood

The following example shows how to use a pattern abbreviation in a DATA step.

data _null_;
  stringD='woodchucks eat firewood';
  rx=rxparse("$p 'wood'");
  PositionOfMatchD=rxmatch(rx,stringD);
  call rxsubstr(rx,stringD,positionD,lengthD);
  valueD=substr(stringD,PositionOfMatchD);
  put 'Example D: ' lengthD= valueD= ;
run;

data _null_;
  stringD2='woodchucks eat firewood';
  rx=rxparse("'wood' $s");
  PositionOfMatchD2=rxmatch(rx,stringD2);
  call rxsubstr(rx,stringD2,positionD2,lengthD2);
  valueD2=substr(stringD2,PositionOfMatchD2);
  put 'Example D2: ' lengthD2= valueD2= ;
run;

The SAS log shows the following results:

Example D: lengthD=4 valueD=woodchucks eat firewood
Example D2: lengthD2=4 valueD2=wood

Matching Balanced Symbols

You can match mathematical expressions containing multiple sets of balanced parentheses, brackets, braces, and less-than/greater-than symbols. Both the symbols and the expressions within the symbols are part of the match:

$(int) or $[int] or ${int} or $<int>

indicates the int level of nesting you specify.

Restriction:

int is a positive integer.

Tip:

Using smaller values increases the efficiency of finding a match.

Example:

This statement and input string produces this match.

rx=rxparse("$(2)");

Input string Position of match Value of match

(((a+b)*5)/43)
2
((a+b)*5)

The following example shows how to use mathematical symbol matching in a DATA step.

data _null_;
   stringE='(((a+b)*5)/43)';
      rx=rxparse("$(2)");
      call rxsubstr(rx,stringE,positionE,lengthE);
      PositionOfMatchE=rxmatch(rx,stringE);
      valueE=substr(stringE,PositionOfMatchE);
      put 'Example E: ' lengthE= valueE= ;
run;

The SAS log shows the following results:

Example E: lengthE=9 valueE=((a+b)*5)/43)

Special Symbols

You can use the following list of special symbols in your pattern:

\ sets the beginning of a match to the current position.

/ sets the end of a match to the current position.

Restriction: If you use a backward slash (\) in one alternative of a union (|), you must use a forward slash ( /) in all alternatives of the union, or in a position preceding or following the union.

$# requests the match with the highest score, regardless of the starting position.

Tip: The position of this symbol within the pattern is not significant.

$- scans a string from right to left.

Tip: The position of this symbol within the pattern is not significant.
Tip: Do not confuse a hyphen (-) used to scan a string with a hyphen used in arithmetic operations.

$@ requires the match to begin where the scan of the text begins.

Tip: The position of this symbol within the pattern is not significant.

The following table shows how a pattern matches an input string.

Pattern Input string Value of match

c\ow

How now brown cow?

characters "ow" in cow

ow/n

How now brown cow?

characters "ow" in brown

@3:\ow

How now brown cow?

characters "ow" in now

Pattern	Input string	Value of match
c\ow	How now brown cow?	characters "ow" in cow
ow/n	How now brown cow?	characters "ow" in brown
@3:\ow	How now brown cow?	characters "ow" in now

The following example shows how to use special symbol matching in a DATA step.

data _null_;
   stringF='How now brown cow?';
   rx=rxparse("$'c\ow'");
   matchF=rxmatch(rx,stringF);
   valueF=substr(stringF,matchF,2);
   put 'Example F= ' matchF= valueF= ;
run;  

data _null_;
   stringF2='How now brown cow?';
   rx=rxparse("@3:\ow");
   matchF2=rxmatch(rx,stringF2);
   valueF2=substr(stringF2,matchF2,2);
   put 'Example F2= ' matchF2= valueF2= ;
run;

The SAS log shows the following results:

Example F= matchF=2 valueF=ow
Example F2= matchF2=6 valueF2=ow

Scores

When a pattern is matched by more than one substring beginning at a specific position, the longest substring is selected. To change the selection criterion, assign a score value to each substring by using the pound sign (#) special symbol followed by an integer.

The score for any substring begins at zero. When #int is encountered in the pattern, the value of int is added to the score. If two or more matching substrings begin at the same leftmost position, SAS selects the substring with the highest score value. If two substrings begin at the same leftmost position and have the same score value, SAS selects the longer substring. The following is a list of score representations:

#int adds int to the score, where int is a positive or negative integer.

#*int multiplies the score by nonnegative int.

#/int divides the score by positive int.

#=int assigns the value of int to the score.

#>int finds a match if the current score exceeds int.

Tag Expression

You can assign a substring of the string being searched to a character variable with the expression name=<pattern>, where pattern specifies any pattern expression. The substring matched by this expression is assigned to the variable name.

If you enclose a pattern in less-than/greater-than symbols (<>) and do not specify a variable name, SAS automatically assigns the pattern to a variable. SAS assigns the variable _1 to the first occurrence of the pattern, _2 to the second occurrence, etc. This assignment is called tagging. SAS tags the corresponding substring of the matched string.

The following shows the syntax of a tag expression:

<pattern>: specifies a pattern expression. SAS assigns a variable to each occurrence of pattern for use in a change expression.

Change Expressions

If you find a substring that matches a pattern, you can change the substring to another value. You must specify the pattern expression, use the TO keyword, and specify the change expression in the argument for RXPARSE. You can specify a list of pattern change expressions by separating each expression with a comma.

A pattern change operation replaces a matched string by concatenating values to the replacement string. The operation concatenates

all characters to the left of the match
the characters specified in the change expression
all characters to the right of the match.

You can have multiple parallel operations within the RXPARSE argument. In the following example,

rx=rxparse("x TO y, y TO x");

x in a substring is substituted for y, and y in a substring is substituted for x.

A change expression can include the items in the following list. Each item in the list is followed by the description of the value concatenated to the replacement string at the position of the pointer.

string in quotation marks: concatenates the contents of the string.
name: concatenates the name, possibly in a different case.
number: concatenates the number.
period (.): concatenates the period (.).
underscore (_): concatenates the underscore (_).
=int: concatenates the value of the int^th tagged substring if int is positive, or the -int^th-from-the-last tagged substring if int is negative. In a parallel change expression, the int^th or -int^th-from-the-last tag is counted within the component of the parallel change expression that yielded the match, and not over the entire parallel change expression.
==: concatenates the entire matched substring.

Change Items

You can use the items in the following list to manipulate the replacement string. The items position the cursor without affecting the replacement string.

@int moves the pointer to column int where the next string added to the replacement string will start.

@= moves the pointer one column past the end of the matched substring.

>int moves the pointer to the right to column int. If the pointer is already to the right of column int, the pointer is not moved.

>= moves the pointer to the right, one column past the end of the matched substring.

<int moves pointer to the left to column int. If the pointer is already to the left of column int, the pointer is not moved.

<= moves the pointer to the left, one column past the end of the matched substring.

+int moves the pointer int columns to the right.

-int moves the pointer int columns to the left.

-L left-aligns the result of the previous item or expression in parentheses.

-R right-aligns the result of the previous item or expression in parentheses.

-C centers the result of the previous item or expression in parentheses.

*int repeats the result of the previous item or expression in parentheses int-1 times, producing a total of int copies.

Details

General Information

When creating a pattern for matching, make the pattern as short as possible for greater efficiency. The time required for matching is roughly proportional to the length of the pattern times the length of the string that is searched.
The algorithm used by the regular expression (RX) functions and CALL routines is a nondeterministic finite automaton.

Using Quotation Marks in Expressions

To specify a literal that begins with a single quotation mark, use two single quotation marks instead of one.
Literals inside a pattern must be enclosed by another layer of quotation marks. For example, " 'O' '' connor" matches an uppercase O, followed by a single quotation mark, followed by the letters "connor" in either upper or lower case.

Comparisons

The regular expression (RX) functions and CALL routines work together to manipulate strings that match patterns. Use the RXPARSE function to parse a pattern you specify. Use the RXMATCH function and the CALL RXCHANGE and CALL RXSUBSTR routines to match or modify your data. Use the CALL RXFREE routine to free allocated space.

Note: Use RXPARSE only with other regular expression (RX) functions and CALL routines. [cautionend]

Example

The following example uses RXPARSE to parse an input string and change the value of the string.

data test; 
   input string $;
   datalines;
abcxyzpq
xyyzxyZx
x2z..X7z
;

data _null_;
  set;
  length to $20;
  if _n_=1 then 
     rx=rxparse("` x < ? > 'z' to ABC =1 '@#%'");
  retain rx;
  drop rx;
  put string=;
  match=rxmatch(rx,string);
     put @3 match=;
  call rxsubstr(rx,string,position);
     put @3 position=;
  call rxsubstr(rx,string,position,length,score);
     put @3 position= Length= Score=;
  call rxchange(rx,999,string,to);
     put @3 to=;
  call rxchange(rx,999,string);
     put @3 'New ' string=;
run;

cpu time 0.05 seconds 1 data test; 2 input string $; 3 datalines; NOTE: The data set WORK.TEST has 3 observations and 1 variables. NOTE: DATA statement used: real time 0.34 seconds cpu time 0.21 seconds 7 ; 8 9 data _null_; 10 set; 11 length to $20; 12 if _n_=1 then 13 rx=rxparse("` x < ? > 'z' to ABC =1 '@#%'"); 14 retain rx; 15 drop rx; 16 put string=; 17 match=rxmatch(rx,string); 18 put @3 match=; 19 call rxsubstr(rx,string,position); 20 put @3 position=; 21 call rxsubstr(rx,string,position,length,score); 22 put @3 position= Length= Score=; 23 call rxchange(rx,999,string,to); 24 put @3 to=; 25 call rxchange(rx,999,string); 26 put @3 'New ' string=; 27 run; string=abcxyzpq match=4 position=4 position=4 length=3 score=0 to=abcabcy@#%pq New string=abcabcy@ string=xyyzxyZx match=0 position=0 position=0 length=0 score=0 to=xyyzxyZx New string=xyyzxyZx string=x2z..X7z match=1 position=1 position=1 length=3 score=0 to=abc2@#%..Abc7@#% New string=abc2@#%. NOTE: DATA statement used: real time 0.67 seconds cpu time 0.45 seconds

See Also

Functions and CALL routines:

CALL RXCHANGE

CALL RXFREE

RXMATCH

CALL RXSUBSTR

Aho, Hopcroft, and Ullman, Chapter 9 (See References)

Chapter Contents
Previous
Next
Top of Page

$a or $A	matches any alphabetic upper- or lowercase letter in a substring ($'a-zA-Z').
$c or $C	matches any character allowed in a version 6 SAS name that is found in a substring ($'0-9a-zA-Z_').
$d or $D	matches any digit in a substring ($'0-9').
$i or $I	matches any initial character in a version 6 SAS name that is found in a substring ($'a-zA-Z_').
$l or $L	matches any lowercase letter in a substring ($'a-z').
$u or $U	matches any uppercase letter in a substring ($'A-Z').
$w or $W	matches any white space character, such as blank, tab, backspace, carriage return, etc., in a substring.
See also:	Character Class Complements

$f or $F	matches a floating point number.
$n or $N	matches a SAS name.
$p or $P	indicates a prefix option.
$q or $Q	matches a string in quotation marks.
$s or $S	indicates a suffix option.

#int	adds int to the score, where int is a positive or negative integer.
#*int	multiplies the score by nonnegative int.
#/int	divides the score by positive int.
#=int	assigns the value of int to the score.
#>int	finds a match if the current score exceeds int.

@int	moves the pointer to column int where the next string added to the replacement string will start.
@=	moves the pointer one column past the end of the matched substring.
>int	moves the pointer to the right to column int. If the pointer is already to the right of column int, the pointer is not moved.
>=	moves the pointer to the right, one column past the end of the matched substring.
<int	moves pointer to the left to column int. If the pointer is already to the left of column int, the pointer is not moved.
<=	moves the pointer to the left, one column past the end of the matched substring.
+int	moves the pointer int columns to the right.
-int	moves the pointer int columns to the left.
-L	left-aligns the result of the previous item or expression in parentheses.
-R	right-aligns the result of the previous item or expression in parentheses.
-C	centers the result of the previous item or expression in parentheses.
*int	repeats the result of the previous item or expression in parentheses int-1 times, producing a total of int copies.