To solve the lexical analysis subproblem, a Java processor must examine each character of the input text, recognizing character sequences as tokens, comments or white space. Regular expressions are used to classify these sequences.
Once a character sequence has been classified, the sequence defined by the regular expression may be extended or shortened by an auxiliary scanner. An auxiliary scanner is associated with a regular expression by specifying it's name, enclosed in parentheses (e.g. (auxNewLine)).
Identifiers and denotations must be retained for further processing. This is done in a uniform way by retaining one copy of each distinct input string appearing as an identifier or a denotation. Each identifier or denotation is represented internally by the index of its string in the string memory. If i is this index, then StringTable(i) is a pointer to the (null-terminated) string.
A token processor can be associated with each regular expression by specifying its name, enclosed in brackets (e.g. [mkidn]). The token processor is a C routine, the intent of which is to construct an integer-valued internal representation of the scanned string. Every token processor obeys the following interface:
A type-gla file specifies the lexical analysis subproblem: