Next: Unicode Up: An Analyzer for Java Previous: An Analyzer for Java

Lexical Structure

To solve the lexical analysis subproblem, a Java processor must examine each character of the input text, recognizing character sequences as tokens, comments or white space. Regular expressions are used to classify these sequences.

Once a character sequence has been classified, the sequence defined by the regular expression may be extended or shortened by an auxiliary scanner. An auxiliary scanner is associated with a regular expression by specifying it's name, enclosed in parentheses (e.g. (auxNewLine)).

Identifiers and denotations must be retained for further processing. This is done in a uniform way by retaining one copy of each distinct input string appearing as an identifier or a denotation. Each identifier or denotation is represented internally by the index of its string in the string memory. If i is this index, then StringTable(i) is a pointer to the (null-terminated) string.

A token processor can be associated with each regular expression by specifying its name, enclosed in brackets (e.g. [mkidn]). The token processor is a C routine, the intent of which is to construct an integer-valued internal representation of the scanned string. Every token processor obeys the following interface:

Token processor[1] :

void #if PROTO_OK (char *c, int l, int *t, int *s) #else (c, l, t, s) char *c; int l, *t, *s; #endif /* On entry- * c points to the first character of the scanned string * l=length of the scanned string * *t=initial classification * On exit- * *t=final classification * *s=internal representation ***/

This macro is invoked in definitions 10, 18, 25, and 31.

A type-gla file specifies the lexical analysis subproblem:

Phrase.gla[2]:

InputElement[5]

This macro is attached to a product file.

Subsections

Next: Unicode Up: An Analyzer for Java Previous: An Analyzer for Java

2008-09-11