next up previous
Next: Unicode Up: An Analyzer for Java Previous: An Analyzer for Java

Lexical Structure

To solve the lexical analysis subproblem, a Java processor must examine each character of the input text, recognizing character sequences as tokens, comments or white space. Regular expressions are used to classify these sequences.

Once a character sequence has been classified, the sequence defined by the regular expression may be extended or shortened by an auxiliary scanner. An auxiliary scanner is associated with a regular expression by specifying it's name, enclosed in parentheses (e.g. (auxNewLine)).

Identifiers and denotations must be retained for further processing. This is done in a uniform way by retaining one copy of each distinct input string appearing as an identifier or a denotation. Each identifier or denotation is represented internally by the index of its string in the string memory. If i is this index, then StringTable(i) is a pointer to the (null-terminated) string.

A token processor can be associated with each regular expression by specifying its name, enclosed in brackets (e.g. [mkidn]). The token processor is a C routine, the intent of which is to construct an integer-valued internal representation of the scanned string. Every token processor obeys the following interface:

Token processor[1] :

(char *c, int l, int *t, int *s)
(c, l, t, s) char *c; int l, *t, *s;
/* On entry-
 *   c points to the first character of the scanned string
 *   l=length of the scanned string
 *   *t=initial classification
 * On exit-
 *   *t=final classification
 *   *s=internal representation
This macro is invoked in definitions 10, 18, 25, and 31.

A type-gla file specifies the lexical analysis subproblem:


This macro is attached to a product file.

next up previous
Next: Unicode Up: An Analyzer for Java Previous: An Analyzer for Java