Guide for New Eli Users
The program generated by Eli
reads a file containing the text, examining it character-by-character.
Character sequences are recognized as significant units or discarded, and
relationships among the sequences are used to build a tree that
reflects the structure of the text.
Computations are then carried out over the tree, and the results of these
computations determine the processor's output.
Thus Eli assumes that the original text processing problem is decomposed into
the problems of determining
which character sequences are significant and which should be discarded,
what tree structure should be imposed on significant sequences,
what computations should be carried out over the resulting tree, and
how the computed values should be encoded and output.
A user describes a particular text processing problem to Eli
by specifying the characteristics of its subproblems.
For example, the tree structure that should be imposed on significant
character sequences is specified to Eli by providing a context-free
grammar.
Different subproblems are characterized in different ways, and they are
specified by descriptions written in different languages.
Those specifications are processed by a variety of tools,
which verify their consistency and generate C-compatible C++ code
to solve the specified subproblems.
Eli focuses the user's attention on specification by
automating the process of tool invocation.
It responds to a request for a piece of computed information, called a
derived object, by invoking the minimum number of tools necessary to
produce that information.
Derived objects are automatically stored by Eli for later re-use,
significantly improving the response time for subsequent requests.
This chapter provides
an overview of the most important subproblems and how they can be described,
summarizes the available descriptive mechanisms and explains the most
common derived objects,
and indicates how Eli is used to create and test programs.
There is considerable variability in the specific decompositions
for particular text processing problems.
For example, one subproblem
of the compilation problem for many programming
languages is overload resolution:
how to decide that (say) a
particular + operator means integer addition instead of real addition.
Overload resolution would probably not be a subproblem of the problem of
creating a PostScript version of a musical score from a description of the
desired notes.
This section briefly reviews five major subproblems common to a wide range of
problems:
- Syntactic analysis
- Determining the structure of an input text
- Lexical analysis
- Recognizing character sequences
- Attribution
- Computing values over trees
- Property storage
- Maintaining information about entities
- Text generation
- Producing structured output
Syntactic analysis
determines the phrase structure
of the input text.
The phrase structure is described to Eli by a context-free grammar.
Here is a context-free grammar
describing the structure of a geographical position
(see Context-Free Grammars and Parsing of Syntactic Analysis):
Position: Latitude Longitude .
Latitude: NS Coordinate .
Longitude: EW Coordinate .
NS: 'N' / 'S' .
EW: 'E' / 'W' .
Coordinate: Integer Float .
This grammar has eight symbols
(Position , Latitude , Longitude ,
NS , EW , Coordinate Integer
and Float ),
four literals
('N' , 'S' , 'E' and 'W' ) and
eight rules
(x: y / z . is an abbreviation for the two rules
x: y . and x: z . ).
Two of the symbols, Integer and Float , are not defined by the
grammar.
One of the symbols, Position , is not used in the definition of any
other symbol.
Symbols that are not defined by the grammar are called
terminal symbols;
those defined by the grammar are called
nonterminal symbols.
The (unique) symbol not used in the definition of any other symbol is
called the axiom
of the grammar.
The entire input text is called a sentence,
and corresponds to the axiom.
Thus the input text N41 58.8 W087 54.3 is a sentence corresponding
to Position .
A phrase
is a portion of the sentence corresponding to a nonterminal symbol
of the grammar.
For example, N41 58.8 is a phrase corresponding to Latitude ,
and 087 54.3 is a phrase corresponding to Coordinate .
54 and 58.8 W087 are not phrases
because they do not correspond to any symbols.
(54 is a part of the phrase 087 54.3 corresponding to
Coordinate and
58.8 W087 is made up of part of the phrase corresponding to
Latitude and part of the phrase corresponding to Longitude .)
Lexical analysis
is the process that examines the source program text,
retaining significant character sequences
and discarding those that are not significant.
A character sequence is significant if it corresponds to a literal in the
grammar or to a terminal symbol.
In the sentence N41 58.8 W087 54.3 , the significant character sequences
are N , 41 , 58.8 , W , 087 and 54.3 .
The spaces separating the numbers of a Coordinate
and preceding the Longitude are not significant.
Eli obtains information about character sequences that correspond to literals
directly from the grammar.
The user must, however, provide descriptions of the character sequences
corresponding to each terminal symbol.
Those descriptions determine the form of the character sequences,
but not their content.
For example, the character sequences corresponding to the terminal symbol
Integer might be described as sequences of one or more decimal
digits, and those corresponding to the terminal symbol Float might be
described as pairs of such sequences separated by a dot
(see Regular Expressions of Lexical Analysis):
Integer: $[0-9]+
Float: $[0-9]+"."[0-9]+
Computations
may be specified over a tree that reflects the phrase structure of a sentence.
That tree has one node
corresponding to each phrase, with the root
corresponding to the sentence.
For example, the tree that reflects the structure of the sentence
N41 58.8 W087 54.3 has a node corresponding to the
Latitude phrase N41 58.8 .
Children
of a node correspond to the immediate component phrases of the
phrase corresponding to that node.
Thus the children of the the node corresponding to the
Latitude phrase N41 58.8
would correspond to the phrases N and 41 58.8 , because the
immediate components of the Latitude phrase N41 58.8 are an
NS phrase (N ) and a Coordinate phrase (41 58.8 ).
One or more values, called attributes,
may be associated with each tree node.
Computations over the tree may involve only attribute access,
C constants,
and function application.
Here is an example:
RULE DegMin: Coordinate ::= Integer Float
COMPUTE
Coordinate.minutes=Minutes(Integer, Float);
END;
According to the rule DegMin , Coordinate has a minutes
attribute attached to it.
The minutes attribute of Coordinate is computed by applying
the function Minutes to the values representing
Integer and Float , but how are those values obtained?
Clearly they must depend on the character sequences corresponding to these
terminal symbols.
A single integer value can be used to represent any terminal symbol.
That value is determined by a token processor
whose name is attached to
the definition of the character sequences corresponding to the symbol,
enclosed in brackets (see Token Processors of Lexical Analysis):
Integer: $[0-9]+ [mkstr]
Float: $[0-9]+"."[0-9]+ [mkstr]
The processor mkstr
is a library routine that stores the character sequence in an array
and yields the sequence's index.
Minutes can then use the index values to obtain the stored strings,
convert them to numbers, and perform the necessary computation.
Often an input text describes some set of entities
and the relationships
among them.
For example, a program in a conventional language may describe some set of
constants, variables, types and procedures, and how these entities are
related in the execution of an algorithm.
Entities may be defined by one part of the input text and used by another.
It is therefore necessary for computations over the tree representing the
phrase structure of a sentence to be able to refer to entities and their
properties at arbitrary points.
Eli provides a definition table
to meet this requirement
(see Definition Table Design Criteria of Definition Table).
Each entity can be represented by a unique definition table key,
which allows access to arbitrary information about that entity.
The Eli user specifies the information that might be stored, and possibly a
set of query
and update
operations for that information.
(Eli provides a standard query operation and a standard update operation
that suffice for most purposes.)
Library modules are available for associating unique definition table keys
with identifiers
according to the scope rules
of the input text,
and for maintaining various kinds of information about entities.
Definition table keys themselves can be stored as attributes,
compared for equality, passed to query and update routines,
and accessed either directly or remotely.
A distinguished value, NoKey ,
represents the absence of a definition table key.
Output
may be derived from arbitrary information in the input text.
The elements of the output may be arranged in arbitrary ways based on
computations over the tree representing the phrase structure of a sentence.
It is therefore useful to be able to build up a complex output data
structure piecemeal, combining components according to information gleaned
from computation.
The program text generation
facility allows the user to specify templates
that describe output text fragments
(see Pattern Specifications of PTG: Pattern-Based Text Generator).
"Holes" in these templates can be filled with text generated according to
other templates.
The result is a directed, acyclic graph in which each node represents a
single template and the children of that node represent generated text
fragments.
Text at the leaves can be generated by arbitrary user-supplied routines.
(A library module provides common leaf generation routines.)
Eli generates a set of functions,
one per template, that are invoked during
computations to build the directed, acyclic graph.
These functions return values that can be stored as attributes,
passed to text generation functions,
and accessed either directly or remotely.
A distinguished value, PTGNULL ,
represents the absence of a graph.
Printing functions
are also provided by Eli to output the generated text on
an arbitrary file (including the standard output unit).
These functions accept a graph and perform a depth-first, left-to-right
traversal.
The text is output during the traversal, with text generated by a common
subgraph being output once for each parent of that subgraph.
The Eli user describes the subproblems
of a particular text processing problem.
Eli derives code fragments from these descriptions, and combines them into
a complete program.
Each description is given in a notation that is ideally suited to the
subproblem being described.
Subproblem descriptions are placed into files,
each of which has a type.
The type is indicated by the file name extension: `foo.c' is a
type-`c' file.
Eli recognizes type-`c' and type-`h' files as C program text and
include files respectively.
Here is a list of the other file types recognized by Eli:
- `specs'
- A collection of object names, one per line.
- `con'
- A description of the phrase structure
of the input text.
Each phrase may be associated with a computation to be carried out when
that phrase is recognized.
Eli generates a parser from these specifications.
See Syntactic Analysis.
- `gla'
- A description of character sequences,
whether they are meaningful or not,
and what (if any) computation should be carried out when they are
recognized in the input text.
Eli generates a scanner from these specifications.
See Lexical Analysis.
- `lido'
- A description of the structure of a tree and the computations
to be carried out on that tree.
Eli generates a tree-walking evaluator from these specifications.
For a discussion on constructing computations in trees, see
LIDO - Computation in Trees.
For reference, see LIDO - Reference Manual.
- `ctl'
- Constraints on evaluator generation.
Eli uses these specifications to modify its behavior when constructing the
routine that carries out the computations.
See LIGA Control Language Reference Manual.
- `ptg'
- A description of structured output text.
Eli generates a set of output functions from these specifications.
See Pattern-Based Text Generator.
- `pdl'
- A definition of entities and their associated properties.
Eli generates a definition table module from these specifications.
See PDL Reference Manual.
- `oil'
- A definition of possible tree node re-labeling.
Eli generates a re-labeling module from these specifications.
See OIL Reference Manual.
- `clp'
- A description of the meanings of command line arguments.
Eli generates a module that accesses command line arguments from these
specifications.
See CLP Reference Manual.
- `map'
- A description of the relationship between the phrase structure
of the input text and the structure of the tree
over which computations are to be made.
Eli uses this specification to determine the tree building actions that
must be attached to rules of the parsing grammar.
See Mapping of Syntactic Analysis.
- `sym'
- This is provided for backward compatibility with previous Eli releases
for specifying symbolic equivalence classes. It is superseded by
type-`map' files.
See Specifying symbolic equivalence classes of Syntactic Analysis.
- `delit'
- Specifies literals
appearing in a type-`con' file that are to be
recognized by special routines.
Each line of a type-`delit' file consists of a regular expression
(see Regular Expressions of Lexical Analysis)
optionally followed by an identifier.
The regular expression defines the literal to be recognized specially.
A
#define directive making the identifier a synonym for that
literal's syntax code is placed in the generated file `litcode.h'.
- `str'
- Specifies initial contents of the identifier table.
Each line of a type-`str' file consists of two integers and a sequence
of characters.
The first integer is the syntax code to be returned by
mkidn
(see Unique Identifier Management of Library Reference Manual), and the second is the length of the character sequence.
The integer representing the length is terminated by a single space.
The character sequence begins immediately after this space, and consists of
exactly the number of characters specified by the length.
- `gnrc'
- Defines a generic module.
Generic modules can be instantiated to yield collections of specifications
that solve specific problems.
- `fw'
- Combines a collection of strongly-coupled specifications with documentation
describing their relationships.
Eli splits these specifications according to their types and processes them
individually.
It can also create a printed document or on-line hypertext from a
type-`fw' file.
- `phi'
- Material to be included
in some part of the generated processor.
Specification files of this type should have a name consisting of three parts:
foo.bar.phi .
All the files whose names end in `bar.phi' are concatenated together
in arbitrary order to form a file named bar.h .
An #include directive can then be used to incorporate `bar.h'
into any generated file.
`.phi'-file-parts may also be generated by different Eli-Tools. They may only
be used in files of type `.h' and `.c'. They are automatically protected
against multiple inclusion.
- `eta'
- Material to be included
in some part of the specifications. Specification files of this
type should have a name consisting of three parts:
foo.bar.eta .
All the files whose names end in `bar.eta' are concatenated
together in arbitrary order to form a file named bar.eta.h . An
#include directive can then be used to incorporate
`bar.eta.h' in any specification file.
`.eta'-file-parts can be used in any specification file with the
exception of `.specs' and `.fw'-files. The generated
include-files are not protected against multiple inclusion.
Any of these files can contain C-style comments
and preprocessor directives
such as #include ,
#define
and #ifdef .
The C preprocessor is applied to files of all types except type-`fw'
before those files are examined.
C-style comments and preprocessor directives appearing in type-`fw'
files are passed unchanged to the files generated from the type-`fw'
file.
Eli includes three pre-defined header files, which are usually generated
from type-`phi' specifications, in specified places:
- `HEAD.h'
- Included at the beginning of the main program, the tree constructor,
and the attribute evaluator.
This header file is used primarily to define abstract data types used in
tree computation.
- `INIT.h'
- Included at the beginning of the main program's executable code.
`INIT.h' may contain declarations, but only if they appear at the
beginning of compound statements that lie wholly within `INIT.h'.
The content of `INIT.h' will be executed before any other code.
Its primary purpose is to initialize abstract data types used in tree
computation.
- `FINL.h'
- Included at the end of the main program's executable code.
`FINL.h' may contain declarations, but only if they appear at the
beginning of compound statements that lie wholly within `FINL.h'.
The content of `FINL.h' will be executed after all other code.
Its primary purpose is to finalize abstract data types used in tree
computation.
Eli recognizes three kinds of object:
a file, a string and a list.
Examples of files are a specification file such as those mentioned in the
previous section, an executable binary file, or an output file from a test
run.
A flag to a command is an example of a string.
Lists are ordered sequences of objects, such as the arguments to a command
or the Makefile, C files, and header files that implement a text processor.
Source objects
can be created or modified directly by the user.
They can be regular files, directories, or symbolic links.
Source objects cannot be automatically recreated by Eli;
they are the basic building blocks
from which Eli creates all other objects.
Every source object is given a type by Eli
based on its host filename,
and this type determines what derived objects can be produced from the
source object.
The file type of a source file is the longest suffix of the file name that
matches one of the source type suffixes listed in the last section.
If no suffix match is found, the file type is empty.
Derived objects
are objects that can be produced from source objects and other derived objects
through the invocation of one or more tools.
Tools are invoked only as needed to create a specified derived object.
Eli automatically caches derived objects for re-use in future
derivations.
Derived objects are created and modified only by Eli itself,
not by users.
A derived object is named by an odin-expression
(see Referring to Objects of User Interface).
Lexically, an odin-expression is composed of a sequence of
identifier
and operator
tokens, and is terminated by a newline character.
An odin-expression can be continued on multiple lines
by escaping each newline character with a backslash.
This backslash (but not the newline) is deleted
before the expression is parsed.
Multiple odin-expressions can be specified on the same line
by separating them with semicolon operators.
An identifier token is just a sequence of characters.
The following characters must be escaped to be included in an identifier:
: + = ( ) / % ; ? $ < > ! # \ ' space tab newline
A single character can be escaped by preceding it with a backslash
(e.g. lost\+found ).
A sequence of characters can be escaped
by enclosing them in single quote marks
(e.g. 'lost+found' ).
Unescaped white space
characters (spaces, tabs, and newlines)
are ignored during parsing except when they separate adjacent identifiers.
Here are a number of odin-expressions that name common objects derived from
the same collection of specifications (all of the spaces are redundant).
The identifier following a colon (:) is an object type (or
product) that characterizes the properties of the derived object,
while the identifier following a plus (+) is a parameter type
that modifies those properties without changing the nature of the derived
object.
(For the characteristics of all of the products and parameters defined by Eli,
see Top of Products and Parameters.)
sets.specs :exe
- is the executable program generated by Eli from the specifications
enumerated in
sets.specs .
It is a normal program for the machine on which it was generated,
and is independent of Eli.
sets.specs :source
- is a set of C files, a set of header files, and a Makefile.
The result of running
make with this information is the executable
program generated by Eli from the specifications
enumerated in sets.specs .
sets.specs :exe :help
- is a browser session that helps to explain inconsistencies in the
specifications enumerated in
sets.specs .
It provides cross-references to on-line documentation and allows you to
invoke an editor on the proper files to make corrections.
. +cmd=(sets.specs :exe) (input) :run
- is the result of running the program generated by Eli from the
specifications enumerated in
sets.specs as a command with the file
`input' from the current directory as an argument.
The name of the directory (in this case . , the name of the current
directory) in which the program is to be executed precedes the parameter
that defines the command to be executed.
sets.specs +monitor +arg=(input) :mon
- is an interaction with the program generated by Eli from the specifications
enumerated in
sets.specs , as it processes the data file
`input'.
This interaction allows you to follow the execution at the level of your
specifications, rather than at the level of the machine on which the
program is running.
sets.specs +debug :dbx
- is an interaction with the program generated by Eli from the specifications
enumerated in
sets.specs using the symbolic debugger of the machine
on which the program is running.
It is useful when some of your specifications are written directly in C.
(Replace dbx with gdb to use the GNU symbolic debugger.)
sets.specs :gencode :viewlist
- is an interactive shell executing in a directory containing all text files
generated by Eli from the specifications enumerated in
sets.specs .
This facility is sometimes useful in diagnosing compiler errors due to type
mismatches.
sets.specs :exe :err >
- is the raw set of reports generated by inconsistencies in the specifications
enumerated in
sets.specs , written to the screen.
(It would be sent to your editor if you replaced > with <.)
This display is sometimes useful if the reports are garbled by the
help derivation.
sets.specs :exe :warn >
- is the raw set of reports generated by anomalies in the specifications
enumerated in
sets.specs , written to the screen.
(It would be sent to your editor if you replaced > with <.)
This display is sometimes useful if the reports are garbled by the
help derivation.
Eli is invoked by giving the command eli .
If you have never used Eli before, it will have to establish a cache
(see Hints on Cache Management).
This process is signaled by a long sequence of messages about installing
packages, followed by a note that the packages have been compiled.
After printing an identifying banner, Eli writes the prompt -> and
waits for input.
The interactive session can be terminated by responding to the ->
prompt with a ^D, and you can browse the documentation by responding
with a question mark.
Entering a derived object name in response to the -> prompt
constitutes a request to bring that derived object up to date with respect
to all of the source objects on which it depends.
Eli will carry out the minimum number of operations required to satisfy
this request.
When the next -> prompt appears, the given object will be up to
date.
Bringing an object up to date does not yield a copy of that object.
To obtain a copy, you must add an output request.
The precise form and effect of an output request depends on whether the
object being output is a file object or a list object.
All source objects are file objects; to find out the kind of a derived
object, consult Top of Products and Parameters.
Here are some examples of common output requests:
sets.specs :parsable >
- requests that the derived file object
sets.specs :parsable
be written to the standard output (normally the screen).
Garbage will result if the derived object is not a text file.
sets.specs :exe > trans
- requests that the derived file object
sets.specs :exe
be written to file `trans'.
The derived object must be a file object, but it need not be text.
If the file `trans' does not exist, it will be created;
if it does exist, it will be overwritten if its content differs from that
of the derived object sets.specs :exe .
If `trans' exists and is a directory, a file named
sets.specs.exe will be written to that directory
(see Extracting and Editing Objects of User Interface).
sets.specs :source > src
- requests that the derived list object
sets.specs :source
be written to directory `src'.
The directory `src' must exist.
A file in `src' before the request will be overwritten only if it has the
same name as one of the file objects in the list sets.specs :source ,
but different content.
(Normally, `src' would be either an empty directory or
one that contains an earlier version of sets.specs :source .)
sets.con <
- requests that your current editor be invoked on the object
sets.con
While one is interacting with Eli, there are a number of situations in
which one wishes to execute normal operating system commands.
These commands can be executed with or without derived objects as
arguments.
We have already seen the most general form, a derived object that is an
execution of an arbitrary command in an arbitrary directory
(see Common Derived Objects).
Although this facility is general enough to handle any command
execution, it is cumbersome for simple commands.
The ! character introduces a host command line
(see Unix Commands of User Interface).
If the first non-white space character following the ! is not
:, ; or = then the rest of the line is treated as a
single, escaped sequence of characters.
This avoids the confusion resulting from interactions between the escape
conventions of host commands and odin-expressions.
A leading :, ;, = or whitespace can be included in the
escaped sequence by preceding it with \.
If the name of a file object precedes the ! character, that object is
brought up to date and the name of a file containing it is appended to the
host command line.
Here are examples of some typical command invocations using !:
!ls
- lists the files in the current directory.
!mkdir src
- makes a new subdirectory of the current directory.
(sets.specs :exe) ! size
- provides information about the space used by the processor generated from
sets.specs .
input +cmd=(sets.specs:exe) :stdout ! diff desired
- compares the file `desired' with the result of applying the processor
generated from
sets.specs to the file `input'.
|