Lexical Analysis
The default behavior of an Eli-generated lexical analyzer is to treat each
ASCII character as an entity distinct from all other ASCII characters.
This behavior is inappropriate for applications that do not distinguish
upper-case letters from lower-case letters in certain contexts.
For example, a Pascal compiler ignores the case of letters in identifiers
and keywords, but distinguishes them in strings.
Thus the Pascal identifiers MyId , MYID and myid are
identical but the strings 'MyString' , 'MYSTRING' and
'mystring' are different.
Case insensitivity is reflected in the identity of character sequences.
In other words, the character sequences MyId , MYID and
myid are considered to be identical character sequences if and only
if the generated processor is insensitive to the case of letters.
Two character sequences are identical as far as the remainder of the
processor is concerned if they have the same classification and their
values are equal (see Specifications).
Since the classification and value are determined by the token processor,
it is the token processor that must implement case insensitivity.
Two conditions must be met if a processor is to be insensitive to case:
-
A token processor that maintains a table of character sequences
in which all letters are of one case must be available.
-
The specification of each case-insensitive character sequence
must invoke such a token processor.
The token processor mkidn
maintains a table of character sequences
and provides the same classification and value for
identical character sequences.
Normally, mkidn treats upper-case letters and lower-case letters as
different characters.
This behavior is controlled by an exported variable, dofold
(see Unique Identifier Management of Library Reference Manual):
When dofold=0 character sequences are entered into the table as they
are specified to mkidn ; otherwise all letters in the sequence are
converted to upper case before the sequence is entered into the table.
Although the value of dofold could be altered on the basis of
context by user-defined code, it is normally constant throughout the
processor's execution.
To generate a processor in which dofold=1 , specify the parameter
+fold in the request
(see fold -- Make the Processor Case-Insensitive of Products and Parameters Reference Manual).
If this parameter is not specified in the request, Eli will produce a
processor with dofold=0 .
The value set by mkidn is the (unique) index of
the transformed character sequence in the table.
Thus if that value is used to retrieve the sequence at a later time, the
result will be the original sequence with all lower-case letters replaced
by their upper-case equivalents.
Since literal symbols are recognized exactly as they stand in the grammar,
they are case sensitive by definition.
For example, if a grammar for Pascal contains the literal symbol
'begin' then the generated processor will recognize only the
character sequence begin as an instance of that literal symbol.
This behavior could be changed by redefining the literal symbol as a
nonliteral symbol (say) BEGIN , and providing the following
specification in a type-`gla' file:
BEGIN: $[Bb][Ee][Gg][Ii][Nn] [mkidn]
If the number of literal symbols to be treated as case-insensitive is
large, this is a very tedious and error-prone approach.
It also distorts the grammar by converting literal terminal symbols
to non-literal terminal symbols.
To solve this problem, Eli allows the user to specify a set of literal
symbols that should be placed into the table used by mkidn , with
their classification codes, at the time the generated lexical analyzer is
loaded.
If the +fold parameter is also specified, all lower-case letters in
these symbols will be replaced by their upper-case equivalents before the
symbol is placed into the table.
The desired behavior is then obtained by invoking mkidn after
recognizing the appropriate character sequence in the input text.
The set of literal symbols to be placed into the table is specified by
giving a sequence of regular expressions in a type-`gla' file, and
then deriving the :kwd product from that file
(see kwd -- Recognize Specified Literals as Identifiers of Products and Parameters Reference Manual).
The regular expressions describe the form of the literal symbols in the
grammar, not the input character sequences to be recognized.
Suppose, for example, that a Pascal grammar specified all keywords as
literal symbols made up of lower-case letters:
Statement:
...
'while' Expression 'do' Statement /
...
A type-`gla' file describing the form these symbols take in the
grammar would consist of the single line $[a-z]+ .
If the name of that file was `PascalKey.gla' then the user could tell
Eli to initialize mkidn 's table with all of the keywords by
including the following line in a type-`specs' file:
PascalKey.gla :kwd
In Pascal, keywords have the form of identifiers in the input text.
Therefore the canned description PASCAL_IDENTIFIER suffices to
recognize both identifiers and keywords.
PASCAL_IDENTIFIER invokes mkidn to obtain the classification
and value of the sequence recognized by the regular
expression $[a-zA-Z][a-zA-Z0-9]* .
Since mkidn 's table has been initialized with the character
sequences for the literal keyword symbols, and their classifications,
they will be appropriately recognized.
The :kwd product and the +fold parameter are independent of
one another.
Thus, in order to make the generated lexical analyzer accept Pascal
keywords with arbitrary case the user must both provide the :kwd
specification and derive with the +fold parameter.
|