For many applications, the exact structure of the symbols that must be
recognized is not important or the problem description specifies
that the symbols should be the same as the symbols used in some other
situation (e.g. identifiers might be specified to use the same format as
To cover this common situation, Eli provides a set of canned symbol
To use a canned description, simply write the canned description's
identifier in a specification instead of writing a regular expression.
For example, the following type-`gla' file tells Eli that the input
text will contain C-style identifiers and strings, Ada-style comments,
and Pascal-style integers:
Integer would appear as
non-literal terminal symbols in the context-free grammar defining the
phrase structure of this input text
(see How to describe a context-free grammar of Syntax Analysis).
The available canned descriptions are defined later in this section.
All of these definitions include a regular expression, and some include
auxiliary scanners and/or token processors.
An auxiliary scanner or token processor specified by a canned description
can be overridden by nominating a different one in the specification that
names the canned description.
For example, the canned description
PASCAL_STRING includes the token
mkstr (see Available scanners).
This token processor stores multiple copies of the same string in the
character storage module.
The following specification overrides
stores only one copy of each distinct string:
Str: PASCAL_STRING [mkidn]
The auxiliary scanner
auxPascalString, included in the canned
description, is not overridden by this specification.
The remainder of this section characterizes the canned descriptions that are
available in the Eli library, and also gives their definitions.
Each of the identifiers in the following list is the name of a canned
description specifying the lexical structure of some component of an
existing programming language.
Here they are simply characterized by the role they play in that language.
A complete definition of each, consisting of a regular expression, possibly
an auxiliary scanner name, and possibly a token processor name, is given in
the next section.
When building a new language, it is a good idea to use canned descriptions
for lexical components:
Time is not wasted in deciding on their form, mistakes are not made in
their implementation, and users are familiar with them.
The list also provides canned descriptions for spaces, tabs and newlines.
These white space characters are treated as comments by default.
If, however, you define any pattern that will accept a white space
character in its first position, this pattern overrides the
default treatment and that white space character will be accepted only
in contexts that are specified explicitly
(see Spaces, Tabs and Newlines).
For example, suppose that the following pattern were defined
and that no other patterns contain spaces:
In that situation, a space will be accepted only if it is part of a
To treat spaces that are not part of a
Separator as comments,
include the canned description
SPACES as a comment specification:
Note that only a white space character that appears at the beginning of a
pattern loses its default interpretation in this way.
In this example, neither the tab nor the newline appeared at the beginning
of a pattern and therefore tabs and newlines continue to be treated as
C_IDENTIFIER, C_INTEGER, C_INT_DENOTATION, C_FLOAT, C_STRING_LIT, C_CHAR_CONSTANT, C_COMMENT
- Identifiers, integer constants, floating point constants, string
literals, character literals, and comments from the C programming
C_INTEGER does not permit the L or U flags, but does correctly
accept all other C integer denotations.
By default, it uses
c_mkint to convert the denotation to an
c_mkint obeys the C rules for determining the radix of the conversion.
C_INT_DENOTATION accepts all valid ANSI C integer denotations.
By default, it uses
mkstr to deliver a unique string table index for
every occurrence of a denotation.
This behavior is often overridden by adding
Integer: C_INT_DENOTATION [mkidn]
In this case, two identical denotations will have the same string table
- Character sequences obeying the the definition of a C identifier, but
accepting all ISO/IEC 8859-1 letters.
Care must be taken in using this description because these identifiers are
not acceptable to most C compilers.
That means they cannot usually be used as (parts of) identifiers in
PASCAL_IDENTIFIER, PASCAL_INTEGER, PASCAL_REAL, PASCAL_STRING, PASCAL_COMMENT
- Identifiers, integer constants, real constants, string literals, and
comments from the Pascal programming language, respectively.
MODULA2_INTEGER, MODULA2_CHARINT, MODULA2_LITERALDQ, MODULA2_LITERALSQ, MODULA2_COMMENT
- Integer constants, characters specified using character codes, string
literals delimited by double and single quotes, and comments from the
Modula-2 programming language, respectively.
- Comments from the Modula-3 programming language.
- Identifiers and comments from the Ada programming language.
- Comments from the AWK programming language.
- Sequence of one or more spaces.
- A single horizontal tab.
- A single newline.
Eli textually replaces a reference to a canned description with its
If a user nominates an auxiliary scanner and/or a token processor
for a canned description, that overrides the corresponding nomination
appearing in the definition of the canned description.
The following is an alphabetized list of the canned descriptions
available in the Eli library, with their definitions.
Use this list as a formal definition, and as an example for constructing
PASCAL_REAL have definitions that are too long
to fit on one line of this document.
Each is, however, a single line in the specification file.)
$' (auxCChar) [c_mkchar]
$\" (auxCString) [mkstr]
$\" (auxM2String) [mkstr]
$\' (auxM2String) [mkstr]
$' (auxPascalString) [mkstr]