Eli   Documents

General Information

 o Eli: Translator Construction Made Easy
 o Global Index
 o Frequently Asked Questions
 o Typical Eli Usage Errors

Tutorials

 o Quick Reference Card
 o Guide For new Eli Users
 o Release Notes of Eli
 o Tutorial on Name Analysis
 o Tutorial on Scope Graphs
 o Tutorial on Type Analysis
 o Typical Eli Usage Errors

Reference Manuals

 o User Interface
 o Eli products and parameters
 o LIDO Reference Manual
 o Typical Eli Usage Errors

Libraries

 o Eli library routines
 o Specification Module Library

Translation Tasks

 o Lexical analysis specification
 o Syntactic Analysis Manual
 o Computation in Trees

Tools

 o LIGA Control Language
 o Debugging Information for LIDO
 o Graphical ORder TOol

 o FunnelWeb User's Manual

 o Pattern-based Text Generator
 o Property Definition Language
 o Operator Identification Language
 o Tree Grammar Specification Language
 o Command Line Processing
 o COLA Options Reference Manual

 o Generating Unparsing Code

 o Monitoring a Processor's Execution

Administration

 o System Administration Guide

Mail Home

Lexical Analysis

Previous Chapter Next Chapter Table of Contents


Canned Symbol Descriptions

For many applications, the exact structure of the symbols that must be recognized is not important or the problem description specifies that the symbols should be the same as the symbols used in some other situation (e.g. identifiers might be specified to use the same format as C identifiers). To cover this common situation, Eli provides a set of canned symbol descriptions.

To use a canned description, simply write the canned description's identifier in a specification instead of writing a regular expression. For example, the following type-`gla' file tells Eli that the input text will contain C-style identifiers and strings, Ada-style comments, and Pascal-style integers:

Identifier: C_IDENTIFIER
            ADA_COMMENT
String:     C_STRING_LIT
Integer:    PASCAL_INTEGER

Identifier, String and Integer would appear as non-literal terminal symbols in the context-free grammar defining the phrase structure of this input text (see How to describe a context-free grammar of Syntax Analysis).

The available canned descriptions are defined later in this section. All of these definitions include a regular expression, and some include auxiliary scanners and/or token processors. An auxiliary scanner or token processor specified by a canned description can be overridden by nominating a different one in the specification that names the canned description. For example, the canned description PASCAL_STRING includes the token processor mkstr (see Available scanners). This token processor stores multiple copies of the same string in the character storage module. The following specification overrides mkstr with mkidn, which stores only one copy of each distinct string:

Str: PASCAL_STRING [mkidn]

The auxiliary scanner auxPascalString, included in the canned description, is not overridden by this specification.

The remainder of this section characterizes the canned descriptions that are available in the Eli library, and also gives their definitions.

Available Descriptions

Each of the identifiers in the following list is the name of a canned description specifying the lexical structure of some component of an existing programming language. Here they are simply characterized by the role they play in that language. A complete definition of each, consisting of a regular expression, possibly an auxiliary scanner name, and possibly a token processor name, is given in the next section.

When building a new language, it is a good idea to use canned descriptions for lexical components: Time is not wasted in deciding on their form, mistakes are not made in their implementation, and users are familiar with them.

The list also provides canned descriptions for spaces, tabs and newlines. These white space characters are treated as comments by default. If, however, you define any pattern that will accept a white space character in its first position, this pattern overrides the default treatment and that white space character will be accepted only in contexts that are specified explicitly (see Spaces, Tabs and Newlines). For example, suppose that the following pattern were defined and that no other patterns contain spaces:

Separator:  $\040+#\040+

In that situation, a space will be accepted only if it is part of a Separator. To treat spaces that are not part of a Separator as comments, include the canned description SPACES as a comment specification:

Separator:  $\040+#\040+
            SPACES

Note that only a white space character that appears at the beginning of a pattern loses its default interpretation in this way. In this example, neither the tab nor the newline appeared at the beginning of a pattern and therefore tabs and newlines continue to be treated as comments.

C_IDENTIFIER, C_INTEGER, C_INT_DENOTATION, C_FLOAT, C_STRING_LIT, C_CHAR_CONSTANT, C_COMMENT
Identifiers, integer constants, floating point constants, string literals, character literals, and comments from the C programming language, respectively.

C_INTEGER does not permit the L or U flags, but does correctly accept all other C integer denotations. By default, it uses c_mkint to convert the denotation to an internal int value. c_mkint obeys the C rules for determining the radix of the conversion.

C_INT_DENOTATION accepts all valid ANSI C integer denotations. By default, it uses mkstr to deliver a unique string table index for every occurrence of a denotation. This behavior is often overridden by adding [mkidn]:

Integer:  C_INT_DENOTATION [mkidn]

In this case, two identical denotations will have the same string table index.

C_IDENTIFIER_ISO
Character sequences obeying the the definition of a C identifier, but accepting all ISO/IEC 8859-1 letters. Care must be taken in using this description because these identifiers are not acceptable to most C compilers. That means they cannot usually be used as (parts of) identifiers in generated code.

PASCAL_IDENTIFIER, PASCAL_INTEGER, PASCAL_REAL, PASCAL_STRING, PASCAL_COMMENT
Identifiers, integer constants, real constants, string literals, and comments from the Pascal programming language, respectively.
MODULA2_INTEGER, MODULA2_CHARINT, MODULA2_LITERALDQ, MODULA2_LITERALSQ, MODULA2_COMMENT
Integer constants, characters specified using character codes, string literals delimited by double and single quotes, and comments from the Modula-2 programming language, respectively.
MODULA3_COMMENT
Comments from the Modula-3 programming language.
ADA_IDENTIFIER, ADA_COMMENT
Identifiers and comments from the Ada programming language.
AWK_COMMENT
Comments from the AWK programming language.
SPACES
Sequence of one or more spaces.
TAB
A single horizontal tab.
NEW_LINE
A single newline.

Definitions of Canned Descriptions

Eli textually replaces a reference to a canned description with its definition. If a user nominates an auxiliary scanner and/or a token processor for a canned description, that overrides the corresponding nomination appearing in the definition of the canned description.

The following is an alphabetized list of the canned descriptions available in the Eli library, with their definitions. Use this list as a formal definition, and as an example for constructing specifications. (C_FLOAT and PASCAL_REAL have definitions that are too long to fit on one line of this document. Each is, however, a single line in the specification file.)

ADA_COMMENT
$-- (auxEOL)
ADA_IDENTIFIER
$[a-zA-Z](_?[a-zA-Z0-9])* [mkidn]
AWK_COMMENT
$# (auxEOL)
C_COMMENT
$"/*" (auxCComment)
C_CHAR_CONSTANT
$' (auxCChar) [c_mkchar]
C_FLOAT
$((([0-9]+\.[0-9]*|\.[0-9]+)((e|E)(\+|-)?[0-9]+)?)| ([0-9]+(e|E)(\+|-)?[0-9]+))[fFlL]? [mkstr]
C_IDENTIFIER
$[a-zA-Z_][a-zA-Z_0-9]* [mkidn]
C_INTEGER
$([0-9]+|0[xX][0-9a-fA-F]*) [c_mkint]
C_INT_DENOTATION
$([1-9][0-9]*|0[0-7]*|0[xX][0-9a-fA-F]+)([uU][lL]?|[lL][uU]?)? [mkstr]
C_STRING_LIT
$\" (auxCString) [mkstr]
MODULA_INTEGER
$[0-9][0-9A-Fa-f]*[BCH]? [modula_mkint]
MODULA2_COMMENT, MODULA3_COMMENT
$\(\* (auxM3Comment)
MODULA2_CHARINT
$[0-9][0-9A-Fa-f]*C [modula_mkint]
MODULA2_INTEGER
$[0-9][0-9A-Fa-f]*[BH]? [modula_mkint]
MODULA2_LITERALDQ
$\" (auxM2String) [mkstr]
MODULA2_LITERALSQ
$\' (auxM2String) [mkstr]
PASCAL_COMMENT
$"{"|"(*" (auxPascalComment)
PASCAL_IDENTIFIER
$[a-zA-Z][a-zA-Z0-9]* [mkidn]
PASCAL_INTEGER
$[0-9]+ [mkint]
PASCAL_REAL
$(([0-9]+\.[0-9]+)((e|E)(\+|-)?[0-9]+)?)|([0-9]+(e|E)(\+|-)?[0-9]+) [mkstr]
PASCAL_STRING
$' (auxPascalString) [mkstr]
SPACES
$\040+
TAB
$\t (auxTab)
NEW_LINE
$[\r\n] (auxNewLine)


Previous Chapter Next Chapter Table of Contents