Lexical Analysis
A specification consists of a regular expression,
possibly the name of an auxiliary scanner,
and possibly the name of a token processor.
Sequences of input characters are classified initially on the basis of the
regular expression they match.
If the line containing the regular expression also contains the name of an
auxiliary scanner, then that scanner is invoked after the regular
expression has been matched.
An auxiliary scanner may lengthen or shorten the character sequence being
classified.
If the line containing the regular expression also contains the name of a token
processor, then that token processor is invoked after any auxiliary scanner.
A token processor may change the initial classification of the sequence,
and may also calculate a value representing the sequence.
Specifications are provided in type-`gla' files whose contents obey
the following phrase structure:
File: ( Specification NewLine )* .
Specification:
[ TokenName ':' ]
Pattern
[ '(' AuxiliaryScannerName ')' ]
[ '[' TokenProcessorName ']' ] .
Pattern: RegularExpression / CannedSpecificationName .
TokenName: Identifier .
CannedSpecificationName: Identifier .
AuxiliaryScannerName: Identifier .
TokenProcessorName: Identifier .
An Identifier is defined as in C, and a type-`gla' file may
contain arbitrary empty lines, C comments and pre-processor directives.
Comments may also be written as character sequences enclosed in braces
({ } ) that do not themselves include braces.
The remainder of this chapter explains each of these components of the
description in detail.
A regular expression is a pattern that defines a set of character
sequences:
If the regular expression matches a particular sequence then that sequence
is a member of the set; otherwise it is not a member.
Here is a summary of Eli's regular expression notation:
c
- matches the character
c , unless c is space, tab, newline
or one of \ " . [ ] ^ ( ) | ? + * { } / $ <
\c
- matches
c
(see Matching operator characters)
"s"
- matches the sequence
s
(see Matching operator characters)
.
- matches any character except newline
[xyz]
- matches exactly one of the characters
x , y or z
[^xyz]
- matches exactly one character that is not
x , y or
z
[c-d]
- matches exactly one of the characters whose ASCII codes lie between the
codes for
c and d (inclusive)
(e)
- matches a sequence matched by
e
ef
- matches a sequence matched by
e followed by a sequence matched by
f
e|f
- matches either a sequence matched by
e or a sequence matched by
f
e?
- matches either an empty sequence or a sequence matched by
e
e+
- matches one or more occurrences of a sequence matched by
e
e*
- matches either an empty sequence or one or more occurrences of a
sequence matched by
e
e{m,n}
- matches a sequence of no fewer than m and no more than n
occurrences of a sequence matched by
e
Each of the regular expressions e? , e+ , e* and
e{m,n} matches the longest sequence of characters
consonant with its definition.
In a type-`gla' file, each regular expression
is delimited on the left by $ and on the right by white space:
$a57D
$[0-9]+
$[a-zA-Z_][a-zA-Z_0-9]*
The first example matches the single character sequence a57D ,
while the second matches a sequence of one or more digits.
The third describes C-style identifiers: an initial letter
or underscore, followed by zero or more alphanumeric characters or underscores.
A regular expression consists of text characters
(which match the corresponding characters in the input character sequences)
and operator characters
(which specify repetitions, choices and other features).
The operator characters are the following:
\ " . [ ] ^ ( ) | ? + * { } / $ <
Space, tab, newline and characters appearing in this list are not
text characters; every other character is a text character.
If an operator character is to match an instance of itself in the input
sequence then it must be marked in the regular expression as being a text
character.
This can be done by preceding it with backslash (\ ).
Any occurrence of an operator character (including backslash) that is
preceded by backslash loses its operator status and
is considered to be a text character.
The text characters space, tab and newline are represented as
\040 ,
\t
and \n
respectively;
\b represents the text character "backspace".
Any character except the ASCII NUL (code 0)
can also be represented by a backslash, followed by zero, followed by the ASCII
code for the character written as a sequence of up to three octal digits (the
representation of a space character always has this form).
A sequence of operator characters can be used as a sequence of text characters
by surrounding the sequence with double quote operators (" ):
xyz"++"
"xyz++"
Both of these patterns match the string xyz++ .
As shown, it is harmless but unnecessary to quote a character that is not an
operator.
Backslash is also effective within a sequence surrounded by double quote
operators, and must be used to mark backslash, quote and white space:
"\t\\\040\"\040.\040[\040]\040^"
This pattern matches an initial segment of the operator character display
at the beginning of this section.
A character class is a pattern that defines a set of characters and
matches exactly one character from that set.
The simplest character class representation is the period (. ),
which defines the set of all characters except newline.
Character classes can also be represented using the operator pair [ ] .
[Abc] defines the set of three characters A , b ,
and c .
Within square brackets, most operator meanings are ignored.
Only four characters are special: \ , - , ^ and
] .
In particular, the double quote character (" ) is not considered
special and therefore cannot be used to surround a sequence of operator
characters.
The \ character provides the usual escapes within character class
brackets.
Thus [[\]] matches either [ or ] ,
because \ causes the first ] in the character class
representation to be taken as a normal character
rather than the closing bracket of the representation.
The following specification causes an error, however:
[["]"]
The quote is not special in a character class,
so the first ] is the closing bracket of the set.
The second " is therefore outside the definition of the character
class, and is taken as the beginning of a quoted string containing the second
] .
Since there is no closing quote for this string, it is erroneous.
If the first character after the opening bracket of a character class
is ^ ,
the set defined by the remainder of the character class is complemented
with respect to the computer's character set.
Using this notation, the character class represented by . can be
described as [^\n] .
If ^ appears as any character of a class except the first,
it is not considered to be an operator.
Thus [^abc] matches any character except a , b ,
or c but [a^bc] or [abc^] matches a , b ,
c or ^ .
Within a character class representation, -
can be used to define a set of characters in terms of a range.
For example, a-z defines the set of lower-case letters and
A-Z defines the set of upper-case letters.
The endpoints of a range may be specified in either order
(i.e. both 0-9 and 9-0 define the set of digits).
Ranges can also be defined in terms of specific ASCII codes:
\041-\0176 is the set of all visible ASCII characters.
Using - between any pair of characters that are not
both upper case letters, both lower case letters, or both digits
defines an implementation-dependent set and will generate a warning.
Any number of ranges can be used in the representation of a character
class.
For example, [a-z0-9<>_] will match any lower case letter, digit,
angle bracket or underline while [^a-zA-Z] will match any character
that is not a letter.
If it is desired to include the character - in a character class,
it should either be escaped with \
or it should occupy the first or last position.
Thus [-+0-9] will match + , - or any digit,
as will [0-9\-+] .
Single characters, character strings and character classes are all simple
regular expressions.
Each matches a particular set of character sequences.
More complex patterns are built from these simple regular expressions by
concatenation, alternation and repetition.
The components of a complex pattern may be grouped by enclosing them in
parentheses; a parenthesized expression behaves like a simple regular
expression in further compositions.
Components must not be separated by white space, because white space
terminates a regular expression.
When a complex regular expression is written as a sequence of components,
the resulting pattern will match a sequence of characters consisting of
a subsequence matching the first component,
followed by a subsequence matching the second component,
and so on:
[1-9]\.[0-9][0-9]
This complex expression has four components:
three character classes and the text character . (the backslash
converts the operator character . to a text character).
It matches character sequences like 2.54 and 9.99 ,
but not 0.59 , 45.678 or 1x23 .
When the components of a complex regular expression are separated by the
operator | ,
the resulting pattern will match a sequence of characters
that matches at least one of the components:
[A-Za-z]|[1-9][0-9]&
This complex expression has two immediate components:
a character class and a complex expression that is the result of
concatenating two character classes and a single character.
The complete expression matches character sequences like B and
10& , but not X11 or A& .
Concatenation takes precedence over alternation in constructing a complex
regular expression, so this example is equivalent to
[A-Za-z]|([1-9][0-9]&) .
Parentheses can be used to group the expression differently:
([A-Za-z]|[1-9][0-9])&
This complex expression also has two immediate components, but they are
a parenthesized expression and a single character.
The complete expression matches character sequences like B& and
10& .
When a complex regular expression consists of a single component
followed by the operator ? ,
the resulting pattern will match either an empty sequence
or a sequence of characters that matches the component:
(-|\+?)[1-9]
Here the operand of ? is the text character + .
This complex expression matches character sequences like -1 ,
+2 and 3 .
In each case, the pattern matches the longest sequence of characters
consonant with its definition.
The ? operator takes precedence over both concatenation and
alternation.
If its operand is a complex expression involving either of these
operations, that complex expression must be parenthesized.
When a complex regular expression consists of a single component
followed by the operator + ,
the resulting pattern will match a sequence of characters
that matches one or more successive occurrences of a sequence matching
the component:
[0-9]+
This complex expression has one immediate component:
a character class.
It matches character sequences like 0 and 1019 .
In each case, the pattern matches the longest sequence of characters
consonant with its definition.
The + operator takes precedence over both concatenation and
alternation.
If its operand is a complex expression involving either of these
operations, that complex expression must be parenthesized.
When a complex regular expression consists of a single component
followed by the operator * ,
the resulting pattern will match a sequence of characters
that matches an empty sequence or one or more successive occurrences
of a sequence matching the component:
[1-9][0-9]*
This complex expression has two immediate components:
a character class and a complex expression whose operator is * .
That complex expression, in turn, has a single character class component.
The complete expression matches character sequences like 1 and
2992 , but not 0 or 0101 .
In each case, the pattern matches the longest sequence of characters
consonant with its definition.
The * operator takes precedence over both concatenation and
alternation.
If its operand is a complex expression involving either of these
operations, that complex expression must be parenthesized.
For example, ([1-9][0-9])* would match character sequences like
1019 and 2992 , but not 1 or 123 .
When a complex regular expression consists of a single component
followed by the operator {m,n}
(m and n integers greater than 0),
the resulting pattern will match a sequence of characters
that matches no fewer than m and no more than n
successive occurrences of a sequence matching the component:
[A-Za-z][A-Za-z0-9]{1,5}
This complex expression has two immediate components:
a character class and a complex expression whose operator is
{1,5} .
That complex expression, in turn, has a single character class component.
The complete expression matches character sequences like A1 and
xyzzy , but not identifier or 01July .
In each case, the pattern matches the longest sequence of characters
consonant with its definition.
The {m,n} operator takes precedence over
both concatenation and alternation.
If its operand is a complex expression involving either of these
operations, that complex expression must be parenthesized.
For example, ([1-9][0-9]){1,2} would match character sequences
like 10 and 2992 ,
but not 1 , 123 or 123456 .
When more than one expression can match the current character sequence,
a choice is made as follows:
-
The longest match is preferred.
-
Among rules which match the same number of characters, the rule given
first is preferred.
Thus, suppose we have the following descriptions:
Limit: $55
Speed: $[0-9]+
If the input text is 550kts then the sequence 550
is classified as Speed , because [0-9]+ matches three characters
while 55 matches only two.
If the input is 55mph then both patterns match two characters,
and the sequence 55 is classified as Limit because Limit
was given first.
Any shorter sequence of digits (e.g. 5kph ) would not match the
regular expression 55
and so the Speed classification would be used.
When more than one type-`gla' file is provided, specifications in
different files have no defined order.
Thus if Limit and Speed appeared in different files,
classification of the sequence 55 would be undefined.
If an ambiguity between two descriptions is to be resolved on the basis of
their order of appearance, they must be given within the same
type-`gla' file.
An auxiliary scanner is a routine to be invoked after the pattern described
by the regular expression has been matched.
The routine is passed a pointer to the matched string and the length of that
string, and it returns a pointer to the first character that is not to be
considered part of the string matched.
Thus an auxiliary scanner may increase, reduce or leave unchanged
the number of characters matched by the regular expression.
This allows a user to specify operationally patterns that are
tedious or impossible to describe using regular expressions
(e.g. nested comments), or that require special operations during the match
(e.g. sequences containing tabs or newlines -- see Spaces, Tabs and Newlines),
or that would benefit from specialized error reporting.
An auxiliary scanner is invoked by giving its name, surrounded by
parentheses (( ) ), on the same line as the associated
regular expression:
$-- (auxEOL)
This specification invokes the auxiliary scanner auxEOL
whenever a sequence of two dashes is recognized, and passes it a pointer to
the first of the two dashes and a length of 2.
As described below, auxEOL returns a pointer to the first character
of the next line, after having updated the coordinate information.
This specification is the implementation of the canned description
ADA_COMMENT .
The remainder of this section describes the auxiliary scanners that
are available in the Eli library, and also
explains how to implement auxiliary scanners for
tasks that are specific to your problem.
All of the auxiliary scanners described in this section can be used simply
by mentioning their names in a specification line.
They can also be invoked from arbitrary C programs if the invoker includes
the header file `ScanProc.h'.
(The source code for that file is `$elipkg/Scan/ScanProc.h'.)
The name of the file containing each available auxiliary scanner is also
given in this section.
It is not necessary to examine this file in order to use the auxiliary
scanner, but sometimes an existing auxiliary scanner can be useful as a
starting point for solving a similar problem
(see Building scanners).
auxNUL
- This routine is invoked automatically when the first character of a
sequence is the ASCII NUL character, a pattern that cannot be specified by
a regular expression.
In that case, the character sequence matched by the associated pattern is
an empty sequence.
If information remains in the current input file,
auxNUL returns a
pointer to the empty sequence at the beginning of that information.
Effectively, this is a pointer to the new information.
This routine is also invoked by any scanner that must accept a newline
character and continue.
Since an ASCII NUL character signalling the end of the current information
in the buffer can occur immediately after any newline, a scanner that
accepts a newline and continues must check for NUL.
If a NUL is found, the scanner invokes auxNUL .
Here is a typical code sequence that such a scanner might use.
The variable p is the scan pointer and
start points to the beginning of the current token:
if (*p == '\0') {
int current = p - start;
TokenStart = start = auxNUL(start, current);
p = start + current;
StartLine = p - 1;
if (*p == '\0') {
/* Code to deal appropriately with end-of-file.
* Some of the possibilities are:
* 1. Output an error report and return p
* 2. Simply return p
* 3. Move to another file and continue
***/
}
}
If information remains in the current input file, the library version of
auxNUL (see Text Input of Library Reference Manual)
appends that information to the character sequence
matched by the associated pattern, possibly relocating the character
sequence matched by the associated pattern.
It returns a pointer to the first character of the sequence matched by the
associated pattern.
Source code: `$elipkg/Scan/auxNUL.c'.
To obtain different behavior when the first character of a sequence is the
ASCII NUL character, supply your own routine with the name auxNUL in
a type-`c' file.
The easiest way to do this is to copy the source code for the library routine
into a local file and then modify it.
auxEOF
- This routine is invoked automatically when the first character of a
sequence is the ASCII NUL character, a pattern that cannot be specified by
a regular expression, and no information remains in the current input file.
In that case, the character sequence matched by the associated pattern is
an empty sequence.
The library version of
auxEOF simply returns the argument supplied to it.
Source code: `$elipkg/Scan/auxEOF.c'.
To obtain different behavior when the first character of a sequence is the
ASCII NUL character, and no information remains in the current input file,
supply your own routine with the name auxEOF in a type-`c' file.
The easiest way to do this is to copy the source code for the library routine
into a local file and then modify it.
coordAdjust
- Leaves the character sequence matched by the associated pattern unchanged.
Updates the coordinate information to reflect the tabs and newlines
in that sequence.
Source code: `$elipkg/Scan/coordAdjust.c'
auxNewLine
- Leaves the character sequence matched by the associated pattern unchanged.
Updates the coordinate information under the assumption that the last
character of that sequence is a newline.
(This is a special case that can be handled more efficiently than the
general case, for which
coordAdjust would be used.)
Source code: `$elipkg/Scan/auxNewLine.c'
auxTab
- Leaves the character sequence matched by the associated pattern unchanged.
Updates the coordinate information under the assumption that the last
character of that sequence is a tab.
(This is a special case that can be handled more efficiently than the
general case, for which
coordAdjust would be used.)
Source code: `$elipkg/Scan/auxTab.c'
auxEOL
- Extends the character sequence matched by the associated pattern to the end
of the current line, including the terminating newline.
Updates the coordinate information to reflect the new position.
Source code: `$elipkg/Scan/auxScanEOL.c'
auxNoEOL
- Extends the character sequence matched by the associated pattern to the end
of the current line, but does not include the terminating newline.
Updates the coordinate information to reflect the new position.
Source code: `$elipkg/Scan/auxNoEOL.c'
auxCString
- Completes a C string constant when provided with the opening quote
(
" ).
Updates the coordinate information to reflect the tabs and newlines
in that sequence.
Source code: `$elipkg/Scan/CchStr.c'.
auxCChar
- Completes a C character constant when provided with the opening quote
(
' ).
Source code: `$elipkg/Scan/CchStr.c'.
auxCComment
- Completes a C comment when provided with the opening delimiter
(
/* ).
Updates the coordinate information to reflect the tabs and newlines
in the comment.
The comment is terminated by the delimiter */ , and may not contain
nested comments.
Source code: `$elipkg/Scan/Ccomment.c'
auxM2String
- Completes a string constant when provided with the opening quote,
possibly followed by other characters.
Updates the coordinate information to reflect the tabs
in that sequence.
The string constant is terminated by an occurrence of the opening quote.
If a newline or the end of the input text is reached before the constant
terminates, auxM2String reports an error.
For Modula2, the opening quote is either the character ' or the
character " .
This auxiliary scanner simply uses the first character of the string matched
by the regular expression as the opening quote character, so it can
complete any sequence of characters that is terminated by
the first character,
and is contained wholly within a single source line.
Note that the characters matched by the regular expression are not
re-scanned for a closing quote.
Source code: `$elipkg/Scan/M2chStr.c'
auxM3Comment
- Completes a Modula2 or Modula3 comment when provided with the opening
delimiter (
(* ).
Updates the coordinate information to reflect the tabs and newlines
in the comment.
The comment is terminated by the delimiter *) , and may contain
nested comments.
Source code: `$elipkg/Scan/M3comment.c'
auxPascalString
- Completes a string constant when provided with the opening quote,
possibly followed by other characters.
Updates the coordinate information to reflect the tabs
in that sequence.
The string constant is terminated by an occurrence of the opening quote
that is not immediately followed by another occurrence of the opening
quote.
(Thus the opening quote character may appear doubled within the string.)
If a newline or the end of the input text is reached before the constant
terminates, auxPascalString reports an error.
For Pascal, the opening quote is the character ' .
This auxiliary scanner simply uses the first character of the string matched
by the regular expression as the opening quote character, so it can
complete any sequence of characters that is terminated by
a single occurrence of the first character,
and not by two successive occurrences of that character,
and is contained wholly within a single source line.
Note that the characters matched by the regular expression are not
re-scanned for a closing quote.
Source code: `$elipkg/Scan/pascalStr.c'
auxPascalComment
- Completes a Pascal comment when provided with the opening delimiter
(either
{ or (* ).
Updates the coordinate information to reflect the tabs and newlines
in the comment.
A comment is terminated by either the delimiter } or the delimiter
*) , regardless of the opening delimiter.
Comments may not be nested.
Source code: `$elipkg/Scan/pascalCom.c'
Ctext
- Completes a C compound statement when provided with the opening brace
(
{ ).
Updates the coordinate information to reflect the tabs and newlines
in the compound statement.
A compound statement is terminated by the matching close brace (} ).
Compound statements may be nested, and unmatched braces may be embedded in
C strings, character constants or comments.
Source code: `$elipkg/Scan/Ctext.c'
All auxiliary scanners obey the same interface conventions:
extern char *Name(char *start, int length);
/* Auxiliary scanner "Name"
* On entry-
* start points to the first character matching the associated
* regular expression
* length=number of characters matching the associated
* regular expression
* On exit-
* Name points to the first character that does not belong to the
* character sequence being classified
***/
Unless otherwise stated, Name>=start on return,
and all characters in the half-open interval [start,Name)
are in memory.
Any auxiliary scanner that passes over tabs or newline characters must
update coordinate information
(see Maintaining the Source Text Coordinates).
In addition, if the character following a newline is an ASCII NUL
then the source buffer must be refilled
(see Text Input of Library Reference Manual).
The easiest way to develop an auxiliary scanner is therefore to start with
one from the library that solves a similar problem.
Source file names for all of the available auxiliary scanners are given
in the previous subsection.
To obtain a copy of (say) the source code for auxNUL as file
`MyScanner.c' in your current directory, give the Eli request:
-> $elipkg/Scan/auxNUL.c > MyScanner.c
After modifying `MyScanner.c', simply add its name to your
type-`specs' file to make it available.
A token processor is a routine to be invoked after the pattern described
by the regular expression has been matched, and after any associated
auxiliary scanner has been invoked.
It is passed a pointer to the matching character sequence,
the length of that sequence,
a pointer to an integer variable containing the classification, and
a pointer to an integer variable to hold a value
representing the character sequence.
The token processor may change the classification,
and may compute a value to represent the sequence.
A token processor is invoked by giving its name, surrounded by
brackets ([ ] ), on the same line as the associated
regular expression:
Integer: $[0-9]+ [mkint]
This specification invokes the token processor mkint
whenever a sequence of digits is recognized.
The arguments are a pointer to the first digit,
the length of the digit sequence,
a pointer to an integer variable containing the classification code for
Integer ,
and a pointer to an integer variable to hold a value
representing the digit sequence.
As described below, mkint leaves the character sequence and
its classification unchanged and
sets the value to the decimal integer denoted by the digit sequence.
This specification is the implementation of the canned description
PASCAL_INTEGER .
This section describes the token processors that are available in the
Eli library, and also explains how to implement token processors for
tasks that are specific to your problem.
All of the token processors described in this section can be used simply
by mentioning their names in a specification line.
They can also be invoked from arbitrary C programs if the invoker includes
the header file `ScanProc.h'.
(The source code for that file is `$elipkg/Scan/ScanProc.h'.)
The name of the file containing each available token processor is also
given in this section.
It is not necessary to examine that file in order to use the token
processor, but sometimes an existing token processor can be useful
as a starting point for solving a similar problem
(see Building processors).
c_mkchar
- Assumes that the character sequence has the form of a C character constant.
Sets the value to the integer encoding of that character constant.
Does not alter the initial classification.
Source file: `$elipkg/Scan/CchStr.c'.
c_mkint
- Assumes that the character sequence has the form of a C integer constant.
Sets the value to the integer represented by that constant.
Does not alter the initial classification.
Source file: `$elipkg/Scan/int.c'.
c_mkstr
- Assumes that the character sequence has the form of a C string constant.
Stores a new copy of that constant in the character storage module
and sets the value to the index of that copy
(see Character String Storage of Library Reference Manual).
If the character constant contains an escape sequence representing ASCII
NUL, it is truncated and an error report is issued.
The last character of the stored constant is the character preceding the
first NUL.
Does not alter the initial classification.
Source file: `$elipkg/Scan/CchStr.c'.
EndOfText
- This processor is invoked automatically
when the end of the input text is reached.
It assumes that the character sequence is empty, and does nothing.
Source file: `$elipkg/Scan/dflteot.c'.
To obtain different behavior when the end of the input text is reached,
supply your own routine with the name EndOfText in
a type-`c' file.
The easiest way to do this is to copy the source code for the library routine
into a local file and then modify it.
lexerr
- Reports that the character sequence is not a token.
Does not alter the initial classification, and does not compute a value.
There is no source file for this token processor; it is a component of the
scanner itself, but its interface is exported so that it can be used by
other modules.
mkidn
- Looks the character sequence up in the identifier table
(see Unique Identifier Management of Library Reference Manual).
If it is not in the table, it is added with its classification unchanged.
Otherwise
mkidn changes the initial classification to the
classification given by the identifier table.
(The identifier table can be initialized with pre-classified character
strings, see Literal Symbols.)
In any case, mkidn sets the value to the (unique) index
of the character sequence in the character storage module
(see Character String Storage of Library Reference Manual).
Source file: `$elipkg/Scan/idn.c'.
mkint
- Assumes that the character sequence consists of one or more decimal digits.
Sets the value to the integer denoted by that sequence of digits.
Does not alter the initial classification.
Source file: `$elipkg/Scan/int.c'.
mkstr
- Stores a new copy of the character sequence in the character storage module
and sets the value to the index of that copy
(see Character String Storage of Library Reference Manual).
Does not alter the initial classification.
Source file: `$elipkg/Scan/str.c'.
modula_mkint
- Assumes that the character sequence consists of one or more hexadecimal
digits, possibly followed by a radix marker.
Sets the value to the integer denoted by that sequence of digits,
interpreted in the given radix.
Does not alter the initial classification.
Valid radix markers are B and C (indicating radix 8), and
H (indicating radix 16).
Sequences of digits not followed by a radix marker are assumed to be radix
10.
Source file: `$elipkg/Scan/M2int.c'.
All token processors obey the same interface conventions:
extern void Name(const char *start, int length, int *syncode, int *intrinsic);
/* Token processor "Name"
* On entry-
* start points to the first character of the sequence being classified
* length=length of the sequence being classified
* syncode points to a location containing the initial classification
* intrinsic points to a location to receive the value
* On exit-
* syncode points to a location containing the final classification
* intrinsic points to a location containing the value (if relevant)
***/
The token processor can change the classification of the character sequence.
It may carry out any computation whatsoever, involving arbitrary modules,
to obtain the information it needs.
Eli generates a file called `termcode.h'
that contains #define directives specifying the classification code
for each symbol appearing before a colon at the beginning of a line in a
type-`gla' file.
Thus if name: ... is a line in a type-`gla' file,
a processor can use the following sequence to change the
classification of any character sequence, including one that is initially
classified as a comment, to name :
#include "termcode.h"
...
*syncode = name;
...
All comments are classified by the value of the symbol NORETURN ,
exported by the lexical analyzer module in file `gla.h'.
A token processor can cause the character sequence matched by its
associated regular expression to be considered a comment by setting the
classification to NORETURN :
#include "gla.h"
...
*syncode = NORETURN;
...
The easiest way to develop a token processor is to start with
one from the library that solves a similar problem.
Source file names for all of the available token processors are given
in the previous subsection.
To obtain a copy of (say) the source code for EndOfText as file
`MyProcessor.c' in your current directory, give the Eli request:
-> $elipkg/Scan/dflteot.c > MyProcessor.c
After modifying `MyProcessor.c', simply add its name to your
type-`specs' file to make it available.
|