Lexical Analysis
This chapter discusses the generated lexical analyzer module,
its interface,
and its relationship to other modules in the generated processor.
An understanding of the material here is not necessary for normal use of
the lexical analyzer.
There are some special circumstances in which it is necessary to change the
interactions between the lexical analyzer and its environment.
For example, there is a mismatch between the lexical analyzer and
the source code input module of a FORTRAN 90 compiler:
The unit of input text dealt with by the source code module is the line,
the unit dealt with by the lexical analyzer is the statement, and there is no
relationship between lines and statements.
One line may contain many statements, or one statement may be spread over
many lines.
This mismatch problem is solved by requiring the two modules to interact via
a buffer, and managing that buffer so that it contains both
an integral number of lines
and an integral number of statements.
Because the lexical analyzer normally works directly in the source module's
buffer, that solution requires a change in the relationship between
the lexical analyzer and its environment.
The interaction between the lexical analyzer and its environment is governed
by the following interface:
#include "gla.h"
/* Entities exported by the lexical analyzer module
* NORETURN (constant) Classification of a comment
* ResetScan (variable) Flag causing scan pointer reset
* TokenStart (variable) Address of first classified character
* TokenEnd (variable) Address of first unclassified character
* StartLine (variable) Column index = (TokenEnd - StartLine)
* glalex (operation) Classify the next character sequence
***/
There are three distinct aspects of the relationship between
the lexical analyzer and its environment,
and each is dealt with in one section of this chapter.
First we consider how the lexical analyzer selects the character sequence to be
scanned, then we see how the lexical analyzer's attention can be switched, and
finally how the classification results are reported.
There is no internal storage for text in the lexical analyzer module.
Instead, TokenEnd is set to point to arbitrary text storage.
(Normally the pointer is to the source buffer,
see Text Input of Library Reference Manual.)
The text pointed to must be an arbitrary sequence of characters, the last
of which is an ASCII NUL.
At the beginning of a scan, TokenEnd points to the beginning of
the string on which a sequence is to be classified.
The lexical analyzer tests that string against its set of regular expressions,
finding the longest sequence that begins with the first character and
matches one of the regular expressions.
If the regular expression matched is associated with an auxiliary scanner
then that auxiliary scanner is invoked with the matched sequence
(see Building scanners).
The auxiliary scanner returns a pointer to the first character that
should not be considered part of the character sequence being matched,
and that pointer becomes the value of TokenEnd .
TokenStart is set to point to the first character of the string.
When no initial character sequence matches any of the regular
expressions an error report is issued, TokenEnd is advanced by one
position (thus discarding the first character of the string), and the
process is restarted.
If the string is initially empty, no attempt is made to match any regular
expressions.
Instead, the auxiliary scanner auxNUL is invoked immediately.
If this auxiliary scanner returns a pointer to an empty string then the
auxiliary scanner auxEOF is invoked immediately.
Finally, if auxEOF returns a pointer to an empty string then the
Token processor EndOfText is invoked immediately.
(If either auxNUL or auxEOF returns a pointer to a non-empty
string, scanning begins on this string as though TokenEnd had
pointed to it initially.)
TokenStart addresses a sequence of length TokenEnd-TokenStart
when a token processor is invoked
(see Token Processors).
Because TokenStart and TokenEnd are exported variables, the
token processor may change them if that is appropriate.
All memory locations below the location pointed to by TokenStart
are undefined in the fullest sense of the word:
Their contents are unknown, and they may not even exist.
Memory locations beginning with the one pointed to by TokenStart ,
up to but not including the one pointed to by TokenEnd , are known to
contain a sequence of non-NUL characters.
TokenEnd points to a sequence of characters, the last of which is an
ASCII NUL.
If the token processor modifies the contents of TokenStart or
TokenEnd , it must ensure that these conditions hold after the
modification.
If the exported variable ResetScan is non-zero when the operation
glalex is invoked, the lexical analyzer's first action is to execute the
macro SCANPTR .
SCANPTR guarantees that TokenEnd addresses the string to be
scanned.
If ResetScan is zero when glalex is invoked, TokenEnd
is assumed to address that string already.
ResetScan is statically initialized to 1 , meaning that
SCANPTR will be executed on the first invocation of glalex .
In the distributed system, SCANPTR sets TokenEnd to point to
the first character of the source module's text buffer.
Since this is also the first character of a line, StartLine must
also be set (see Maintaining the Source Text Coordinates):
#define SCANPTR { TokenEnd = TEXTSTART; StartLine = TokenEnd - 1; }
See Text Input of Library Reference Manual.
This implementation can be changed by supplying a file `scanops.h',
containing a new definition of SCANPTR ,
as one of your specification files.
ResetScan is set to zero after SCANPTR has been executed.
Normally, it will never again have the value 1 .
Thus SCANPTR will not be executed on any subsequent invocation of
glalex .
Periodic refilling of the source module's text buffer and associated
re-setting of TokenEnd is handled by auxNUL
when the lexical analyzer detects that the string is exhausted.
More complex behavior, using ResetScan to force resets at arbitrary
points, is always possible via token processors or other clients.
TokenEnd is statically initialized to 0 .
Once scanning has begun, TokenEnd should always point to a location
in the source buffer (see Text Input of Library Reference Manual).
Thus SCANPTR can normally distinguish between initialization and
arbitrary re-setting by testing TokenEnd .
(If user code sets TokenEnd to 0 , of course, this test may
not be valid.)
The classification operation glalex is invoked with a pointer to an
integer variable that may be set to the value
representing the classified sequence.
An integer result specifying the classification is returned by
glalex , and the coordinates of the first character of the sequence
are stored in the error module's exported variable curpos
(see Source Text Coordinates and Error Reporting of Library Reference Manual).
There are three points at which these interactions can be altered:
-
Setting coordinate values
-
Deciding on a continuation after a classification
-
Returning a classification
All of these alterations are made by supplying macro definitions in a
specification file called `scanops.h'.
The remainder of this section defines the macro interfaces and gives the
default implementations.
The coordinates of the first character of a sequence are set by the macro
SETCOORD .
Its default implementation uses the standard coordinate invariant
(see Maintaining the Source Text Coordinates):
/* Set the coordinates of the current token
* On entry-
* LineNum=index of the current line in the entire source text
* p=index of the current column in the entire source line
* On exit-
* curpos has been updated to contain the current position as its
* left coordinate
*/
#define SETCOORD(p) { LineOf(curpos) = LineNum; ColOf(curpos) = (p); }
When execution monitoring
(see Monitoring of Monitoring) is in effect, more care
must be taken.
In addition to the above, SETCOORD must also set the cumulative
column position, which is the column position within the overall input
stream (as opposed to just the current input file).
Ordinarily the two column positions will be the same, so the default
implementation of SETCOORD for monitoring is:
#define SETCOORD(p) { LineOf(curpos) = LineNum; \
ColOf(curpos) = CumColOf(curpos) = (p); }
When monitoring, it is also necessary to set the coordinates of the
first character beyond the sequence.
This is handled by the macro SETENDCOORD :
/* Set the coordinates of the end of the current token
* On entry-
* LineNum=index of the current line in the entire source text
* p=index of the current column in the entire source line
* On exit-
* curpos has been updated to contain the current position as its
* right coordinate
*/
#ifndef SETENDCOORD
#define SETENDCOORD(p) { RLineOf(curpos) = LineNum; \
RColOf(curpos) = RCumColOf(curpos) = (p); }
#endif
Classification is complete after the regular expression has been matched,
any specified auxiliary scanner invoked, and any specified token processor
invoked.
At this point, one of three distinct actions is possible:
RETURN v
- Terminate the invocation of
glalex , returning the value v as
the classification.
goto rescan
- Start a new scan at the character addressed by
TokenEnd ,
without changing the coordinate value.
continue
- Start a new scan at the character addressed by
TokenEnd ,
resetting the coordinates to the coordinates of that character.
WRAPUP is the macro responsible for deciding among these
possibilities.
When it is executed, TokenEnd addresses the first character beyond
the classified sequence and extcode holds the classification code.
Here is the default implementation:
#define WRAPUP { if (extcode != NORETURN) RETURN extcode; }
If WRAPUP does not transfer control, the result is the
continue action.
Thus the default implementation of WRAPUP terminates the invocation
of glalex if the current character sequence is not classified as a
comment (extcode != NORETURN ), and starts a new scan at the next
character if the current character sequence is classified as a comment.
If execution monitoring is in effect, the classification event must be
reported in addition to selecting a continuation:
#define WRAPUPMONITOR { \
if (extcode != NORETURN) { \
char save = *TokenEnd; \
*TokenEnd = '\0'; \
generate_token("token", LineOf(curpos), ColOf(curpos), \
CumColOf(curpos), RLineOf(curpos), RColOf(curpos), \
RCumColOf(curpos), TokenStart, TokenEnd - TokenStart, \
*v, extcode); \
*TokenEnd = save; \
} \
}
WRAPUPMONITOR is invoked instead of WRAPUP if execution
monitoring is in effect.
Once the decision has been made to terminate the glalex operation
and report the classification, it is possible to carry out arbitrary
operations in addition to returning the classification code.
For example, execution monitoring requires that this event be reported.
Here is the default implementation:
#ifdef MONITOR
#define RETURN(v) { generate_leave("lexical"); return v; }
#else
#define RETURN(v) { return v; }
#endif
Recognition of Occam2 block structure from indentation is an example of
how a token processor might use the lexical analyzer interface
(see Using Literal Symbols to Represent Other Things).
The token processor OccamIndent is invoked after a newline character
(possibly followed by spaces and/or tabs) has been recognized:
#include "err.h"
#include "gla.h"
#include "source.h"
#include "litcode.h"
extern char *auxNUL();
extern char *coordAdjust();
#define MAXNEST 50
static int IndentStack[MAXNEST] = {1};
static int *Current = IndentStack;
void
OccamIndent(char *start, int length, int *syncode, int *intrinsic)
{ if (start[length] == '\0') {
start = auxNUL(start, length);
if (start[length] != '\0') { TokenEnd = start; return; };
TokenEnd = start + length;
}
if (*TokenEnd == '\0' && Current == IndentStack) return;
{ char *OldStart = StartLine;
int OldLine = LineNum, Position;
(void)coordAdjust(start, length); Position = TokenEnd-StartLine;
if (*Current == Position) *syncode = Separate;
else if (*Current < Position) {
*syncode = Initiate;
if (Current == IndentStack + MAXNEST)
message(DEADLY, "Nesting depth exceeded", 0, &curpos);
*++Current = Position;
} else {
*syncode = Terminate; Current--;
LineNum = OldLine; StartLine = OldStart; TokenEnd = start;
}
}
}
Since the source buffer is guaranteed only to hold an integral number of
lines (see Text Input of Library Reference Manual),
OccamIndent must first refill the buffer if necessary.
The library routine auxNUL carries out this task, returning a
pointer to the character sequence passed to it
(see Available scanners).
Remember that the character sequence may be moved in the process of refilling
the buffer, and therefore it is vital to reset both start and
TokenEnd after the operation.
If auxNUL is invoked and adds characters to the buffer, then those
characters might be white space that should have been part of the original
pattern.
In this case OccamIndent can return, having set TokenEnd to
point to the first character of the original sequence.
Since the sequence was initially classified as a comment (because the
specification did not begin with an identifier followed by a colon,
see Using Literal Symbols to Represent Other Things), the
overall effect will be to re-scan the newline and the text now following
it.
If auxNUL is invoked but does not add characters to the buffer, then
the newline originally matched is the last character of the file.
TokenEnd should be set to point to the character following the
newline.
When the end of the file has been reached, and no blocks remain
unterminated, then the newline character has no meaning.
By returning under these conditions, OccamIndent classifies the
newline as a comment.
Otherwise, the character sequence matched by the pattern must be
interpreted on the basis of the indentation it represents.
Because a single character sequence may terminate any number of blocks, it
may be necessary to interpret it as a sequence of terminators.
The easiest way to do this is to keep re-scanning the same sequence,
returning one terminator each time, until all of the relevant blocks have
been terminated.
In order to make that possible, OccamIndent must save the current
values of the pointer from which column indexes are determined
(StartLine ) and the cumulative line number (LineNum ).
The pattern with which OccamIndent is associated will match a
character sequence beginning with a newline and containing an arbitrary
sequence of spaces and tabs.
To determine the column index of the first character following this
sequence, apply coordAdjust to it (see Available scanners).
That auxiliary scanner leaves the character sequence unchanged, but
re-establishes the invariant on LineNum and StartLine
(see Maintaining the Source Text Coordinates).
After the invariant is re-established, the column index can be computed.
Current points to the element of IndentStack containing the
column index of the first character of a line belonging to the current block.
(If no block has been opened, the value is 1.)
When the column index of the character following the initial white space
is equal to this value, that white space should be classified as a
separator.
Otherwise, if the column index shows an indentation then the white space
should be classified as an initiator and the new column position should be
pushed onto the stack.
Stack overflow is a deadly error, making further processing impossible
(see Source Text Coordinates and Error Reporting of Library Reference Manual).
Finally, if the column index shows an exdentation then the white space
should be classified as a terminator and the column position for the
terminated block deleted from the stack.
When a newline terminates a block, it must be re-scanned and interpreted in
the context of the text surrounding the terminated block.
Therefore in this case StartLine and LineNum are restored to
the values they had before coordAdjust was invoked, and
TokenStart is set to point to the newline character at the start of
the sequence.
Thus the next invocation of the lexical analyzer will again recognize the
sequence and invoke OccamIndent to interpret it.
|