A Simple Scanner in Go

The following Go program is an example of a simple scanner (aka lexer, or tokenizer). A scanner reads a string (or file), and converts the string into a stream of tokens for the a language. In this case, we are scanning a subset of Go.

There are a number of ways to write scanners, and the approach here s simple and hand-written. Scanners can in fact be automated, using tools like Lex, or even regular expressions.

A few things to note:

  • A token consists of a string value and a type. In Go, there are no enumerated types, and instead programmers use constants. The use of iota is a Go-specific trick to avoid writing a value for each constant.

  • Each token in the language needs its own type. Some tokens, like identifiers or numbers, have many different possible values, while others, like + or ==, have only one value.

  • Go, like most languages, reserves some identifiers to be keywords. For example, if, const, for, and type are all reserved for use in Go as keywords. That means you cannot use them for the names of variables, functions, types, etc.

  • fmt is not a keyword. Instead, fmt is the name of a standard Go package. If you don’t import fmt, then you can create your own variables, functions, etc. named fmt (although that might be bad practice since it could be quite confusing).

  • This programs stores keywords in a Go map, i.e. container of key:value pairs that is often implemented as a hash table. This is because when the tokenizer reads in strings of characters and must decide if its a token or just an identifier, and so using a map makes checking if a string is a keyword fast and easy.

  • Notice how we define the keywords map:

    var keywords map[string]bool = map[string]bool{
        "package": true, "import": true, "func": true, "type": true,
        "struct": true, "for": true,
    }
    

    keywords has the type map[string]bool, where string is the type of the key and bool is the type of the value. After the = comes a map literal, which starts with the type of the map, map[string]bool, and then a sequence of key:value pairs.

    Notice that there is a , at the end of the list — Go requires this (it makes the compiling Go a little easier).

  • Here’s how you check if a Go map has a particular key:

    _, isKeyWord := keywords[str]
    

    When you call keywords[str], it returns two things: first the value associated with str in the table, and also a bool that indicates whether or not the str is a key in the table. Since we don’t care about the value, we use the blank identifier _ for the return value.

  • A lot of simple helper functions have been defined for recognizing subsets of characters. The functions are small and use previous functions when possible.

  • The switch statement in Go is more general than in the C-like languages. In tokenize, it is just a syntactic an alternative to an if- else-if statement; you can use whatever looks best to you.

  • The code for the scanner is not difficult, but you need to take care to ensure that you don’t skip any characters, or read the same character twice. So we need to carefully keep track of the value of i.

  • Another basic issue is that we must always check for the end of the string. This complicates some of the logic, but it is necessary and important.

  • Note that when tokenizing keywords/identifiers, we go a greedy match, i.e. we look for the longest string of characters in a row that match. If, instead, we stopped after that first match, an identifier like form would match the keyword for, which is not what we want.

  • If you write enough scanners, you soon notice the same kinds of code coming up again and again, and so it is possible to write various helper functions to simplify the scanner. For example, regular expressions can be used to match tokens, or a tool like Lex can create a scanner given a suitable input.

  • The main application of a scanner is as the first processing step for a compiler or interpreter. But it has other uses, e.g. you could use it to format or colorize source code. Since you now the type of all the tokens, it would not be hard to display them a particular color, or format them in a pleasing way. This is essentially what text editors due when they colorize your source code, although typically regular expressions would be used to do it (instead of a hand-written scanner),

  • The scanner just translates a stream of characters into tokens. The scanner doesn’t know anything about the order of the tokens. That’s the parser’s job. The parser needs to use a grammar to to determine whether or not a stream of tokens is valid.

Source Code

package main

import "fmt"

//
// token types
//

// iota is a Go trick for automatically numbering constants
const (
    KEYWORD     = iota // e.g. func, range, package, ...
    IDENT              // e.g. x, result, fmt, ...
    STR_LIT            // e.g. "apple"
    OPEN_BRACE         // {
    CLOSE_BRACE        // }
    OPEN_PAREN         // (
    CLOSE_PAREN        // )
    DOT                // .
    ASSIGN             // =
)

// An incomplete list of pre-defined Go keywords. They are stored in a map so
// the tokenizer can quickly test if a string is keyword.
var keywords map[string]bool = map[string]bool{
    "package": true, "import": true, "func": true, "type": true,
    "struct": true, "for": true,
}

// A Token consists of a string value and a token type called kind (because
// type is a Go keyword).
type token struct {
    value string
    kind  int
}

//
// helper functions
//

func isWhitespace(r rune) bool {
    return r == ' ' || r == '\n' || r == '\t'
}

func isLower(r rune) bool {
    return 'a' <= r && r <= 'z'
}

func isUpper(r rune) bool {
    return 'A' <= r && r <= 'Z'
}

func isLetter(r rune) bool {
    return isLower(r) || isUpper(r)
}

func isIdentStartChar(r rune) bool {
    return r == '_' || isLetter(r)
}

func isDigit(r rune) bool {
    return '0' <= r && r <= '9'
}

func isIdentChar(r rune) bool {
    return isIdentStartChar(r) || isDigit(r)
}

// tokenize translates a string into a slice of tokens. If no errors are
// encountered, then err is nil. Otherwise, err contains a message about the
// error.
//
// tokenize works by using an index, i, to point to the current character in s
// that is being processed. Care must be taken to increment i at the right
// time to ensure no characters are skipped, or read more than once.
//
// Note that a lot of the code is repetitive, and it is possible to generalize
// and shorten this code with more helper functions.
func tokenize(s string) (result []token, err error) {
    for i := 0; i < len(s); {
        // This form of switch statement is essentially a variation of a
        // standard if-else-if structure.
        switch {
        case isWhitespace(rune(s[i])): // ignore whitespace
            i++
        case s[i] == '=':
            result = append(result, token{"=", ASSIGN})
            i++
        case s[i] == '{':
            result = append(result, token{"{", OPEN_BRACE})
            i++
        case s[i] == '}':
            result = append(result, token{"}", CLOSE_BRACE})
            i++
        case s[i] == '(':
            result = append(result, token{"(", OPEN_PAREN})
            i++
        case s[i] == ')':
            result = append(result, token{")", CLOSE_PAREN})
            i++
        case s[i] == '.':
            result = append(result, token{".", DOT})
            i++
        case s[i] == '"': // string literal
            i++ // skip first "
            start := i
            // first " char not preceded by a \
            for !(i >= len(s) || (s[i] == '"' && s[i-1] != '\\')) {
                i++
            }
            // invariant: i >= len(s) || (s[i] == '"' && s[i-1] != '\\')
            if i >= len(s) {
                return result, fmt.Errorf("unterminated string literal")
            } else {
                str := s[start:i] // first and last " excluded
                result = append(result, token{str, STR_LIT})
                i++
            }
        case isIdentStartChar(rune(s[i])): // keyword or identifier
            // greedy matching: the longest sequence of identifier characters
            // is matched
            start := i
            i++ // move i to next unread char
            for i < len(s) && isIdentChar(rune(s[i])) {
                i++
            }
            // invariant: !(i < len(s) && isIdentChar(rune(s[i])))
            //             i >= len(s) || !isIdentChar(rune(s[i]))
            // i is one more than the index of the last char of the identifier
            str := s[start:i]

            // check if its a keyword
            kind := IDENT
            _, isKeyWord := keywords[str]
            if isKeyWord {
                kind = KEYWORD
            }
            result = append(result, token{str, kind})
        default:
            return result, fmt.Errorf("unknown character '%c'", s[i])
        } // switch
    } // for
    return result, nil
} // tokenize

func main() {
    // fmt.Println(keywords)
    // s := "package trapCode.x = \"abort\""
    // s := "tm = \"\\\"cat\\\"\""
    // s := "\"a\"=\"b\""
    // s := "a=b"
    s := "a\"b\"c"
    tokens, err := tokenize(s)
    fmt.Printf("s=%s\n%v\nerr=%v\n", s, tokens, err)
} // main