Eli   Documents

General Information

 o Eli: Translator Construction Made Easy
 o Global Index
 o Frequently Asked Questions
 o Typical Eli Usage Errors

Tutorials

 o Quick Reference Card
 o Guide For new Eli Users
 o Release Notes of Eli
 o Tutorial on Name Analysis
 o Tutorial on Scope Graphs
 o Tutorial on Type Analysis
 o Typical Eli Usage Errors

Reference Manuals

 o User Interface
 o Eli products and parameters
 o LIDO Reference Manual
 o Typical Eli Usage Errors

Libraries

 o Eli library routines
 o Specification Module Library

Translation Tasks

 o Lexical analysis specification
 o Syntactic Analysis Manual
 o Computation in Trees

Tools

 o LIGA Control Language
 o Debugging Information for LIDO
 o Graphical ORder TOol

 o FunnelWeb User's Manual

 o Pattern-based Text Generator
 o Property Definition Language
 o Operator Identification Language
 o Tree Grammar Specification Language
 o Command Line Processing
 o COLA Options Reference Manual

 o Generating Unparsing Code

 o Monitoring a Processor's Execution

Administration

 o System Administration Guide

Mail Home

Lexical Analysis

Previous Chapter Next Chapter Table of Contents


Spaces, Tabs and Newlines

An Eli-generated processor examines its input text sequentially, recognizing character sequences in the order in which they appear. At each point it matches the longest possible sequence, classifies that sequence, and then begins anew with the next character. If the first character of a sequence is a space, tab or newline then the default behavior is to classify the sequence consisting of that character and all succeeding spaces, tabs and newlines as a comment. This behavior is consistent with the definitions of most programming languages, and is reasonable in a large fraction of text processing tasks.

Even though tabs and newlines are considered comments by default, some processing is needed to account for their effect on the source text position. Eli-generated processors define a two-dimensional coordinate system (line number and column index), which they use to link error reports to the source text (see Source Text Coordinates and Error Reporting of Library Reference Manual).

White space may be significant in two situations:

  1. Within a character sequence, such as spaces in a string
  2. On its own, such as line boundaries in a type-`gla' file

Appropriate white space may be specified as part of the description of a complete character sequence (provided that it is not at the beginning) without disrupting the default behavior. (Coordinate processing for tabs and newlines must be provided if they are allowed within the sequence.) The default behavior is overridden, however, by any specification of white space on its own or at the beginning of another character sequence. Overriding is specific to the white space character used: a specification of new behavior for a space overrides the default behavior for a space, but not the default behavior for a tab or newline.

The following sections explain how coordinate processing is provided for newlines and tabs, and how to re-establish default behavior of white space on its own when white space can occur at the beginning of another character sequence.

Maintaining the Source Text Coordinates

The raw data for determining coordinates are two variables, LineNum (an integer variable exported by the error module, see Source Text Coordinates and Error Reporting of Library Reference Manual) and StartLine (a character pointer exported by the lexical analyzer). The following invariant must be maintained on these variables:

LineNum=Cumulative index of the current line in the input text
(Pointer to current character)-StartLine=index of the current character
    in the current line

This invariant must hold whenever the lexical analyzer begins to process a character sequence. It may be destroyed during the processing of that sequence, but must be re-established before processing of the next character sequence begins.

LineNum is initially 1, and must be incremented each time the lexical analyzer advances beyond a newline character in the input text. At the beginning of each line, StartLine must be set to point to the character position preceding the first character of that line. As the current character pointer is advanced, the condition on StartLine is maintained automatically unless the character pointer advances over a tab character.

A tab character in the input text represents one or more spaces, depending upon its position relative to the next tab stop, but it occupies only one character position. If the tab represents n spaces, n-1 must be subtracted from StartLine to maintain the invariant.

Because the value of n depends upon the index of the current character and the settings of the tab stops in the line, Eli provides an operation TABSIZE(i) (defined in file `tabsize.h') to compute it. The argument i is the index in the current line of the character position beyond that containing the tab, and the result is the number of spaces that must be added to reach the next tab stop.

Suppose that p is a pointer to the current input character. Here is a code sequence that maintains the condition on StartLine when a tab is encountered:

#include "tabsize.h"
...
  if ((*p++) == '\t') StartLine -= TABSIZE(p - StartLine);
...

TABSIZE defines the positions of the tab stops. The default implementation provides tab stops every 8 character positions. A user changes this default by supplying a new version of the Eli library routine TabSize. The source code for the library version of this routine can be obtained by making the following request:

-> $elipkg/gla/tabsize.c > MyTabSize.c

After modifying the routine appropriately, add the name MyTabSize.c to your type-`specs' file.

The coordinate invariant is maintained automatically if no patterns matching tabs or newline characters are defined, and no auxiliary scanners that advance over tabs or newline characters are provided by the user. If such patterns or scanners are needed, then the user must define them in such a way that they maintain the coordinate invariant.

Three auxiliary scanners (coordAdjust, auxTab and auxNewLine) are available to maintain the coordinate invariant for a regular expression that matches tabs or newline characters (see Available scanners). While these auxiliary scanners could be invoked by user-defined auxiliary scanners that advance over tabs or newline characters, it is often simpler to include the appropriate code to maintain the coordinate invariant.

For an example of the use of code in an auxiliary scanner to maintain the coordinate invariant, see the library version of auxNUL.

Restoring the Default Behavior for White Space

When a pattern beginning with a space, tab or newline character overrides the default behavior for that character, the character will only be accepted as part of an explicit pattern. The default behavior can be restored by using one of the canned descriptions SPACES, TAB or NEW_LINE respectively (see Available Descriptions):

Define:  $\040+define
         SPACES

Here the pattern for Define overrides the default behavior for space characters. If this were the only specification, spaces in the input text would only be accepted if they occurred immediately before the character sequence define. By adding the canned description SPACES, and classifying the sequences it matches as comments, the default behavior is restored.

Note that this specification is ambiguous: A sequence of spaces followed by define could either match the Define pattern or the spaces alone could be classified as the comment specified by SPACES. The principle of the longest match guarantees that in this case the sequence will be classified as Define (see What happens if the specification is ambiguous?).

Making White Space Illegal

When white space is illegal at the beginning of a pattern, the default treatment of white space must be overridden with an explicit comment pattern. Because the sequence is specified to be a comment, nothing will be returned to the parser. A token processor like lexerr can be used to report the error:

	SPACES	[lexerr]

The canned descriptions SPACES, TAB and NEW_LINE should be used as patterns in such specifications because they handle all of the coordinate updating (see Maintaining the Source Text Coordinates).


Previous Chapter Next Chapter Table of Contents