General Information
Tutorials
Reference Manuals
Libraries
Translation Tasks
Tools
Administration
|
|
Lexical Analysis
For many applications, the exact structure of the symbols that must be
recognized is not important or the problem description specifies
that the symbols should be the same as the symbols used in some other
situation (e.g. identifiers might be specified to use the same format as
C identifiers).
To cover this common situation, Eli provides a set of canned symbol
descriptions.
To use a canned description, simply write the canned description's
identifier in a specification instead of writing a regular expression.
For example, the following type-`gla' file tells Eli that the input
text will contain C-style identifiers and strings, Ada-style comments,
and Pascal-style integers:
Identifier: C_IDENTIFIER
ADA_COMMENT
String: C_STRING_LIT
Integer: PASCAL_INTEGER
Identifier , String and Integer would appear as
non-literal terminal symbols in the context-free grammar defining the
phrase structure of this input text
(see How to describe a context-free grammar of Syntax Analysis).
The available canned descriptions are defined later in this section.
All of these definitions include a regular expression, and some include
auxiliary scanners and/or token processors.
An auxiliary scanner or token processor specified by a canned description
can be overridden by nominating a different one in the specification that
names the canned description.
For example, the canned description PASCAL_STRING includes the token
processor mkstr (see Available scanners).
This token processor stores multiple copies of the same string in the
character storage module.
The following specification overrides mkstr with mkidn , which
stores only one copy of each distinct string:
Str: PASCAL_STRING [mkidn]
The auxiliary scanner auxPascalString , included in the canned
description, is not overridden by this specification.
The remainder of this section characterizes the canned descriptions that are
available in the Eli library, and also gives their definitions.
Each of the identifiers in the following list is the name of a canned
description specifying the lexical structure of some component of an
existing programming language.
Here they are simply characterized by the role they play in that language.
A complete definition of each, consisting of a regular expression, possibly
an auxiliary scanner name, and possibly a token processor name, is given in
the next section.
When building a new language, it is a good idea to use canned descriptions
for lexical components:
Time is not wasted in deciding on their form, mistakes are not made in
their implementation, and users are familiar with them.
The list also provides canned descriptions for spaces, tabs and newlines.
These white space characters are treated as comments by default.
If, however, you define any pattern that will accept a white space
character in its first position, this pattern overrides the
default treatment and that white space character will be accepted only
in contexts that are specified explicitly
(see Spaces, Tabs and Newlines).
For example, suppose that the following pattern were defined
and that no other patterns contain spaces:
Separator: $\040+#\040+
In that situation, a space will be accepted only if it is part of a
Separator .
To treat spaces that are not part of a Separator as comments,
include the canned description SPACES as a comment specification:
Separator: $\040+#\040+
SPACES
Note that only a white space character that appears at the beginning of a
pattern loses its default interpretation in this way.
In this example, neither the tab nor the newline appeared at the beginning
of a pattern and therefore tabs and newlines continue to be treated as
comments.
C_IDENTIFIER, C_INTEGER, C_INT_DENOTATION, C_FLOAT, C_STRING_LIT, C_CHAR_CONSTANT, C_COMMENT
- Identifiers, integer constants, floating point constants, string
literals, character literals, and comments from the C programming
language, respectively.
C_INTEGER does not permit the L or U flags, but does correctly
accept all other C integer denotations.
By default, it uses c_mkint to convert the denotation to an
internal int value.
c_mkint obeys the C rules for determining the radix of the conversion.
C_INT_DENOTATION accepts all valid ANSI C integer denotations.
By default, it uses mkstr to deliver a unique string table index for
every occurrence of a denotation.
This behavior is often overridden by adding [mkidn] :
Integer: C_INT_DENOTATION [mkidn]
In this case, two identical denotations will have the same string table
index.
C_IDENTIFIER_ISO
- Character sequences obeying the the definition of a C identifier, but
accepting all ISO/IEC 8859-1 letters.
Care must be taken in using this description because these identifiers are
not acceptable to most C compilers.
That means they cannot usually be used as (parts of) identifiers in
generated code.
PASCAL_IDENTIFIER, PASCAL_INTEGER, PASCAL_REAL, PASCAL_STRING, PASCAL_COMMENT
- Identifiers, integer constants, real constants, string literals, and
comments from the Pascal programming language, respectively.
MODULA2_INTEGER, MODULA2_CHARINT, MODULA2_LITERALDQ, MODULA2_LITERALSQ, MODULA2_COMMENT
- Integer constants, characters specified using character codes, string
literals delimited by double and single quotes, and comments from the
Modula-2 programming language, respectively.
MODULA3_COMMENT
- Comments from the Modula-3 programming language.
ADA_IDENTIFIER, ADA_COMMENT
- Identifiers and comments from the Ada programming language.
AWK_COMMENT
- Comments from the AWK programming language.
SPACES
- Sequence of one or more spaces.
TAB
- A single horizontal tab.
NEW_LINE
- A single newline.
Eli textually replaces a reference to a canned description with its
definition.
If a user nominates an auxiliary scanner and/or a token processor
for a canned description, that overrides the corresponding nomination
appearing in the definition of the canned description.
The following is an alphabetized list of the canned descriptions
available in the Eli library, with their definitions.
Use this list as a formal definition, and as an example for constructing
specifications.
(C_FLOAT and PASCAL_REAL have definitions that are too long
to fit on one line of this document.
Each is, however, a single line in the specification file.)
ADA_COMMENT
$-- (auxEOL)
ADA_IDENTIFIER
$[a-zA-Z](_?[a-zA-Z0-9])* [mkidn]
AWK_COMMENT
$# (auxEOL)
C_COMMENT
$"/*" (auxCComment)
C_CHAR_CONSTANT
$' (auxCChar) [c_mkchar]
C_FLOAT
$((([0-9]+\.[0-9]*|\.[0-9]+)((e|E)(\+|-)?[0-9]+)?)|
([0-9]+(e|E)(\+|-)?[0-9]+))[fFlL]? [mkstr]
C_IDENTIFIER
$[a-zA-Z_][a-zA-Z_0-9]* [mkidn]
C_INTEGER
$([0-9]+|0[xX][0-9a-fA-F]*) [c_mkint]
C_INT_DENOTATION
$([1-9][0-9]*|0[0-7]*|0[xX][0-9a-fA-F]+)([uU][lL]?|[lL][uU]?)? [mkstr]
C_STRING_LIT
$\" (auxCString) [mkstr]
MODULA_INTEGER
$[0-9][0-9A-Fa-f]*[BCH]? [modula_mkint]
MODULA2_COMMENT, MODULA3_COMMENT
$\(\* (auxM3Comment)
MODULA2_CHARINT
$[0-9][0-9A-Fa-f]*C [modula_mkint]
MODULA2_INTEGER
$[0-9][0-9A-Fa-f]*[BH]? [modula_mkint]
MODULA2_LITERALDQ
$\" (auxM2String) [mkstr]
MODULA2_LITERALSQ
$\' (auxM2String) [mkstr]
PASCAL_COMMENT
$"{"|"(*" (auxPascalComment)
PASCAL_IDENTIFIER
$[a-zA-Z][a-zA-Z0-9]* [mkidn]
PASCAL_INTEGER
$[0-9]+ [mkint]
PASCAL_REAL
$(([0-9]+\.[0-9]+)((e|E)(\+|-)?[0-9]+)?)|([0-9]+(e|E)(\+|-)?[0-9]+)
[mkstr]
PASCAL_STRING
$' (auxPascalString) [mkstr]
SPACES
$\040+
TAB
$\t (auxTab)
NEW_LINE
$[\r\n] (auxNewLine)
|