Eli   Documents

General Information

 o Eli: Translator Construction Made Easy
 o Global Index
 o Frequently Asked Questions
 o Typical Eli Usage Errors

Tutorials

 o Quick Reference Card
 o Guide For new Eli Users
 o Release Notes of Eli
 o Tutorial on Name Analysis
 o Tutorial on Scope Graphs
 o Tutorial on Type Analysis
 o Typical Eli Usage Errors

Reference Manuals

 o User Interface
 o Eli products and parameters
 o LIDO Reference Manual
 o Typical Eli Usage Errors

Libraries

 o Eli library routines
 o Specification Module Library

Translation Tasks

 o Lexical analysis specification
 o Syntactic Analysis Manual
 o Computation in Trees

Tools

 o LIGA Control Language
 o Debugging Information for LIDO
 o Graphical ORder TOol

 o FunnelWeb User's Manual

 o Pattern-based Text Generator
 o Property Definition Language
 o Operator Identification Language
 o Tree Grammar Specification Language
 o Command Line Processing
 o COLA Options Reference Manual

 o Generating Unparsing Code

 o Monitoring a Processor's Execution

Administration

 o System Administration Guide

Mail Home

Lexical Analysis

Previous Chapter Next Chapter Table of Contents


The Generated Lexical Analyzer Module

This chapter discusses the generated lexical analyzer module, its interface, and its relationship to other modules in the generated processor. An understanding of the material here is not necessary for normal use of the lexical analyzer.

There are some special circumstances in which it is necessary to change the interactions between the lexical analyzer and its environment. For example, there is a mismatch between the lexical analyzer and the source code input module of a FORTRAN 90 compiler: The unit of input text dealt with by the source code module is the line, the unit dealt with by the lexical analyzer is the statement, and there is no relationship between lines and statements. One line may contain many statements, or one statement may be spread over many lines. This mismatch problem is solved by requiring the two modules to interact via a buffer, and managing that buffer so that it contains both an integral number of lines and an integral number of statements. Because the lexical analyzer normally works directly in the source module's buffer, that solution requires a change in the relationship between the lexical analyzer and its environment.

The interaction between the lexical analyzer and its environment is governed by the following interface:

#include "gla.h"
/* Entities exported by the lexical analyzer module
 * NORETURN	(constant)	Classification of a comment
 * ResetScan	(variable)	Flag causing scan pointer reset
 * TokenStart	(variable)	Address of first classified character
 * TokenEnd	(variable)	Address of first unclassified character
 * StartLine	(variable)	Column index = (TokenEnd - StartLine)
 * glalex	(operation)	Classify the next character sequence
 ***/

There are three distinct aspects of the relationship between the lexical analyzer and its environment, and each is dealt with in one section of this chapter. First we consider how the lexical analyzer selects the character sequence to be scanned, then we see how the lexical analyzer's attention can be switched, and finally how the classification results are reported.

Interaction Between the Lexical Analyzer and the Text

There is no internal storage for text in the lexical analyzer module. Instead, TokenEnd is set to point to arbitrary text storage. (Normally the pointer is to the source buffer, see Text Input of Library Reference Manual.) The text pointed to must be an arbitrary sequence of characters, the last of which is an ASCII NUL.

At the beginning of a scan, TokenEnd points to the beginning of the string on which a sequence is to be classified. The lexical analyzer tests that string against its set of regular expressions, finding the longest sequence that begins with the first character and matches one of the regular expressions.

If the regular expression matched is associated with an auxiliary scanner then that auxiliary scanner is invoked with the matched sequence (see Building scanners). The auxiliary scanner returns a pointer to the first character that should not be considered part of the character sequence being matched, and that pointer becomes the value of TokenEnd. TokenStart is set to point to the first character of the string.

When no initial character sequence matches any of the regular expressions an error report is issued, TokenEnd is advanced by one position (thus discarding the first character of the string), and the process is restarted. If the string is initially empty, no attempt is made to match any regular expressions. Instead, the auxiliary scanner auxNUL is invoked immediately. If this auxiliary scanner returns a pointer to an empty string then the auxiliary scanner auxEOF is invoked immediately. Finally, if auxEOF returns a pointer to an empty string then the Token processor EndOfText is invoked immediately. (If either auxNUL or auxEOF returns a pointer to a non-empty string, scanning begins on this string as though TokenEnd had pointed to it initially.)

TokenStart addresses a sequence of length TokenEnd-TokenStart when a token processor is invoked (see Token Processors). Because TokenStart and TokenEnd are exported variables, the token processor may change them if that is appropriate. All memory locations below the location pointed to by TokenStart are undefined in the fullest sense of the word: Their contents are unknown, and they may not even exist. Memory locations beginning with the one pointed to by TokenStart, up to but not including the one pointed to by TokenEnd, are known to contain a sequence of non-NUL characters. TokenEnd points to a sequence of characters, the last of which is an ASCII NUL. If the token processor modifies the contents of TokenStart or TokenEnd, it must ensure that these conditions hold after the modification.

Resetting the Scan Pointer

If the exported variable ResetScan is non-zero when the operation glalex is invoked, the lexical analyzer's first action is to execute the macro SCANPTR. SCANPTR guarantees that TokenEnd addresses the string to be scanned. If ResetScan is zero when glalex is invoked, TokenEnd is assumed to address that string already. ResetScan is statically initialized to 1, meaning that SCANPTR will be executed on the first invocation of glalex.

In the distributed system, SCANPTR sets TokenEnd to point to the first character of the source module's text buffer. Since this is also the first character of a line, StartLine must also be set (see Maintaining the Source Text Coordinates):

#define SCANPTR { TokenEnd = TEXTSTART; StartLine = TokenEnd - 1; }

See Text Input of Library Reference Manual. This implementation can be changed by supplying a file `scanops.h', containing a new definition of SCANPTR, as one of your specification files.

ResetScan is set to zero after SCANPTR has been executed. Normally, it will never again have the value 1. Thus SCANPTR will not be executed on any subsequent invocation of glalex. Periodic refilling of the source module's text buffer and associated re-setting of TokenEnd is handled by auxNUL when the lexical analyzer detects that the string is exhausted. More complex behavior, using ResetScan to force resets at arbitrary points, is always possible via token processors or other clients.

TokenEnd is statically initialized to 0. Once scanning has begun, TokenEnd should always point to a location in the source buffer (see Text Input of Library Reference Manual). Thus SCANPTR can normally distinguish between initialization and arbitrary re-setting by testing TokenEnd. (If user code sets TokenEnd to 0, of course, this test may not be valid.)

The Classification Operation

The classification operation glalex is invoked with a pointer to an integer variable that may be set to the value representing the classified sequence. An integer result specifying the classification is returned by glalex, and the coordinates of the first character of the sequence are stored in the error module's exported variable curpos (see Source Text Coordinates and Error Reporting of Library Reference Manual).

There are three points at which these interactions can be altered:

  1. Setting coordinate values
  2. Deciding on a continuation after a classification
  3. Returning a classification

All of these alterations are made by supplying macro definitions in a specification file called `scanops.h'. The remainder of this section defines the macro interfaces and gives the default implementations.

Setting coordinate values

The coordinates of the first character of a sequence are set by the macro SETCOORD. Its default implementation uses the standard coordinate invariant (see Maintaining the Source Text Coordinates):

/* Set the coordinates of the current token
 *   On entry-
 *     LineNum=index of the current line in the entire source text
 *     p=index of the current column in the entire source line
 *   On exit-
 *     curpos has been updated to contain the current position as its
 *     left coordinate
 */
#define SETCOORD(p) { LineOf(curpos) = LineNum; ColOf(curpos) = (p); }

When execution monitoring (see Monitoring of Monitoring) is in effect, more care must be taken. In addition to the above, SETCOORD must also set the cumulative column position, which is the column position within the overall input stream (as opposed to just the current input file). Ordinarily the two column positions will be the same, so the default implementation of SETCOORD for monitoring is:

#define SETCOORD(p) { LineOf(curpos) = LineNum; \
		      ColOf(curpos) = CumColOf(curpos) = (p); }

When monitoring, it is also necessary to set the coordinates of the first character beyond the sequence. This is handled by the macro SETENDCOORD:

/* Set the coordinates of the end of the current token
 *   On entry-
 *     LineNum=index of the current line in the entire source text
 *     p=index of the current column in the entire source line
 *   On exit-
 *     curpos has been updated to contain the current position as its
 *     right coordinate
 */
#ifndef SETENDCOORD
#define SETENDCOORD(p) { RLineOf(curpos) = LineNum; \
			 RColOf(curpos) = RCumColOf(curpos) = (p); }
#endif

Deciding on a continuation after a classification

Classification is complete after the regular expression has been matched, any specified auxiliary scanner invoked, and any specified token processor invoked. At this point, one of three distinct actions is possible:

RETURN v
Terminate the invocation of glalex, returning the value v as the classification.
goto rescan
Start a new scan at the character addressed by TokenEnd, without changing the coordinate value.
continue
Start a new scan at the character addressed by TokenEnd, resetting the coordinates to the coordinates of that character.

WRAPUP is the macro responsible for deciding among these possibilities. When it is executed, TokenEnd addresses the first character beyond the classified sequence and extcode holds the classification code. Here is the default implementation:

#define WRAPUP { if (extcode != NORETURN) RETURN extcode; }

If WRAPUP does not transfer control, the result is the continue action. Thus the default implementation of WRAPUP terminates the invocation of glalex if the current character sequence is not classified as a comment (extcode != NORETURN), and starts a new scan at the next character if the current character sequence is classified as a comment.

If execution monitoring is in effect, the classification event must be reported in addition to selecting a continuation:

#define WRAPUPMONITOR { \
  if (extcode != NORETURN) { \
    char save = *TokenEnd; \
    *TokenEnd = '\0'; \
    generate_token("token", LineOf(curpos), ColOf(curpos), \
		    CumColOf(curpos), RLineOf(curpos), RColOf(curpos), \
		    RCumColOf(curpos), TokenStart, TokenEnd - TokenStart, \
		    *v, extcode); \
    *TokenEnd = save; \
  } \
}

WRAPUPMONITOR is invoked instead of WRAPUP if execution monitoring is in effect.

Returning a classification

Once the decision has been made to terminate the glalex operation and report the classification, it is possible to carry out arbitrary operations in addition to returning the classification code. For example, execution monitoring requires that this event be reported. Here is the default implementation:

#ifdef MONITOR
#define RETURN(v) { generate_leave("lexical"); return v; }
#else
#define RETURN(v) { return v; }
#endif

An Example of Interface Usage

Recognition of Occam2 block structure from indentation is an example of how a token processor might use the lexical analyzer interface (see Using Literal Symbols to Represent Other Things). The token processor OccamIndent is invoked after a newline character (possibly followed by spaces and/or tabs) has been recognized:

#include "err.h"
#include "gla.h"
#include "source.h"
#include "litcode.h"

extern char *auxNUL();
extern char *coordAdjust();

#define MAXNEST 50
static int IndentStack[MAXNEST] = {1};
static int *Current = IndentStack;

void
OccamIndent(char *start, int length, int *syncode, int *intrinsic)
{ if (start[length] == '\0') {
    start = auxNUL(start, length);
    if (start[length] != '\0') { TokenEnd = start; return; };
    TokenEnd = start + length;
  }

  if (*TokenEnd == '\0' && Current == IndentStack) return;

  { char *OldStart = StartLine;
    int OldLine = LineNum, Position;

    (void)coordAdjust(start, length); Position = TokenEnd-StartLine;
    if (*Current == Position) *syncode = Separate;
    else if (*Current < Position) {
      *syncode = Initiate;
      if (Current == IndentStack + MAXNEST)
        message(DEADLY, "Nesting depth exceeded", 0, &curpos);
      *++Current = Position;
    } else {
      *syncode = Terminate; Current--;
      LineNum = OldLine; StartLine = OldStart; TokenEnd = start;
    }
  }
}

Since the source buffer is guaranteed only to hold an integral number of lines (see Text Input of Library Reference Manual), OccamIndent must first refill the buffer if necessary. The library routine auxNUL carries out this task, returning a pointer to the character sequence passed to it (see Available scanners). Remember that the character sequence may be moved in the process of refilling the buffer, and therefore it is vital to reset both start and TokenEnd after the operation.

If auxNUL is invoked and adds characters to the buffer, then those characters might be white space that should have been part of the original pattern. In this case OccamIndent can return, having set TokenEnd to point to the first character of the original sequence. Since the sequence was initially classified as a comment (because the specification did not begin with an identifier followed by a colon, see Using Literal Symbols to Represent Other Things), the overall effect will be to re-scan the newline and the text now following it.

If auxNUL is invoked but does not add characters to the buffer, then the newline originally matched is the last character of the file. TokenEnd should be set to point to the character following the newline.

When the end of the file has been reached, and no blocks remain unterminated, then the newline character has no meaning. By returning under these conditions, OccamIndent classifies the newline as a comment. Otherwise, the character sequence matched by the pattern must be interpreted on the basis of the indentation it represents.

Because a single character sequence may terminate any number of blocks, it may be necessary to interpret it as a sequence of terminators. The easiest way to do this is to keep re-scanning the same sequence, returning one terminator each time, until all of the relevant blocks have been terminated. In order to make that possible, OccamIndent must save the current values of the pointer from which column indexes are determined (StartLine) and the cumulative line number (LineNum).

The pattern with which OccamIndent is associated will match a character sequence beginning with a newline and containing an arbitrary sequence of spaces and tabs. To determine the column index of the first character following this sequence, apply coordAdjust to it (see Available scanners). That auxiliary scanner leaves the character sequence unchanged, but re-establishes the invariant on LineNum and StartLine (see Maintaining the Source Text Coordinates). After the invariant is re-established, the column index can be computed.

Current points to the element of IndentStack containing the column index of the first character of a line belonging to the current block. (If no block has been opened, the value is 1.) When the column index of the character following the initial white space is equal to this value, that white space should be classified as a separator. Otherwise, if the column index shows an indentation then the white space should be classified as an initiator and the new column position should be pushed onto the stack. Stack overflow is a deadly error, making further processing impossible (see Source Text Coordinates and Error Reporting of Library Reference Manual). Finally, if the column index shows an exdentation then the white space should be classified as a terminator and the column position for the terminated block deleted from the stack.

When a newline terminates a block, it must be re-scanned and interpreted in the context of the text surrounding the terminated block. Therefore in this case StartLine and LineNum are restored to the values they had before coordAdjust was invoked, and TokenStart is set to point to the newline character at the start of the sequence. Thus the next invocation of the lexical analyzer will again recognize the sequence and invoke OccamIndent to interpret it.


Previous Chapter Next Chapter Table of Contents