Eli   Documents

General Information

 o Eli: Translator Construction Made Easy
 o Global Index
 o Frequently Asked Questions
 o Typical Eli Usage Errors

Tutorials

 o Quick Reference Card
 o Guide For new Eli Users
 o Release Notes of Eli
 o Tutorial on Name Analysis
 o Tutorial on Scope Graphs
 o Tutorial on Type Analysis
 o Typical Eli Usage Errors

Reference Manuals

 o User Interface
 o Eli products and parameters
 o LIDO Reference Manual
 o Typical Eli Usage Errors

Libraries

 o Eli library routines
 o Specification Module Library

Translation Tasks

 o Lexical analysis specification
 o Syntactic Analysis Manual
 o Computation in Trees

Tools

 o LIGA Control Language
 o Debugging Information for LIDO
 o Graphical ORder TOol

 o FunnelWeb User's Manual

 o Pattern-based Text Generator
 o Property Definition Language
 o Operator Identification Language
 o Tree Grammar Specification Language
 o Command Line Processing
 o COLA Options Reference Manual

 o Generating Unparsing Code

 o Monitoring a Processor's Execution

Administration

 o System Administration Guide

Mail Home

Lexical Analysis

Previous Chapter Next Chapter Table of Contents


Case Insensitivity

The default behavior of an Eli-generated lexical analyzer is to treat each ASCII character as an entity distinct from all other ASCII characters. This behavior is inappropriate for applications that do not distinguish upper-case letters from lower-case letters in certain contexts. For example, a Pascal compiler ignores the case of letters in identifiers and keywords, but distinguishes them in strings. Thus the Pascal identifiers MyId, MYID and myid are identical but the strings 'MyString', 'MYSTRING' and 'mystring' are different.

Case insensitivity is reflected in the identity of character sequences. In other words, the character sequences MyId, MYID and myid are considered to be identical character sequences if and only if the generated processor is insensitive to the case of letters. Two character sequences are identical as far as the remainder of the processor is concerned if they have the same classification and their values are equal (see Specifications). Since the classification and value are determined by the token processor, it is the token processor that must implement case insensitivity.

Two conditions must be met if a processor is to be insensitive to case:

  1. A token processor that maintains a table of character sequences in which all letters are of one case must be available.

  2. The specification of each case-insensitive character sequence must invoke such a token processor.

A Case-Insensitive Token Processor

The token processor mkidn maintains a table of character sequences and provides the same classification and value for identical character sequences. Normally, mkidn treats upper-case letters and lower-case letters as different characters. This behavior is controlled by an exported variable, dofold (see Unique Identifier Management of Library Reference Manual): When dofold=0 character sequences are entered into the table as they are specified to mkidn; otherwise all letters in the sequence are converted to upper case before the sequence is entered into the table.

Although the value of dofold could be altered on the basis of context by user-defined code, it is normally constant throughout the processor's execution. To generate a processor in which dofold=1, specify the parameter +fold in the request (see fold -- Make the Processor Case-Insensitive of Products and Parameters Reference Manual). If this parameter is not specified in the request, Eli will produce a processor with dofold=0.

The value set by mkidn is the (unique) index of the transformed character sequence in the table. Thus if that value is used to retrieve the sequence at a later time, the result will be the original sequence with all lower-case letters replaced by their upper-case equivalents.

Making Literal Symbols Case Insensitive

Since literal symbols are recognized exactly as they stand in the grammar, they are case sensitive by definition. For example, if a grammar for Pascal contains the literal symbol 'begin' then the generated processor will recognize only the character sequence begin as an instance of that literal symbol. This behavior could be changed by redefining the literal symbol as a nonliteral symbol (say) BEGIN, and providing the following specification in a type-`gla' file:

BEGIN:  $[Bb][Ee][Gg][Ii][Nn]  [mkidn]

If the number of literal symbols to be treated as case-insensitive is large, this is a very tedious and error-prone approach. It also distorts the grammar by converting literal terminal symbols to non-literal terminal symbols.

To solve this problem, Eli allows the user to specify a set of literal symbols that should be placed into the table used by mkidn, with their classification codes, at the time the generated lexical analyzer is loaded. If the +fold parameter is also specified, all lower-case letters in these symbols will be replaced by their upper-case equivalents before the symbol is placed into the table. The desired behavior is then obtained by invoking mkidn after recognizing the appropriate character sequence in the input text.

The set of literal symbols to be placed into the table is specified by giving a sequence of regular expressions in a type-`gla' file, and then deriving the :kwd product from that file (see kwd -- Recognize Specified Literals as Identifiers of Products and Parameters Reference Manual). The regular expressions describe the form of the literal symbols in the grammar, not the input character sequences to be recognized.

Suppose, for example, that a Pascal grammar specified all keywords as literal symbols made up of lower-case letters:

Statement:
  ...
  'while' Expression 'do' Statement /
  ...

A type-`gla' file describing the form these symbols take in the grammar would consist of the single line $[a-z]+. If the name of that file was `PascalKey.gla' then the user could tell Eli to initialize mkidn's table with all of the keywords by including the following line in a type-`specs' file:

PascalKey.gla :kwd

In Pascal, keywords have the form of identifiers in the input text. Therefore the canned description PASCAL_IDENTIFIER suffices to recognize both identifiers and keywords. PASCAL_IDENTIFIER invokes mkidn to obtain the classification and value of the sequence recognized by the regular expression $[a-zA-Z][a-zA-Z0-9]*. Since mkidn's table has been initialized with the character sequences for the literal keyword symbols, and their classifications, they will be appropriately recognized.

The :kwd product and the +fold parameter are independent of one another. Thus, in order to make the generated lexical analyzer accept Pascal keywords with arbitrary case the user must both provide the :kwd specification and derive with the +fold parameter.


Previous Chapter Next Chapter Table of Contents