Syntactic Analysis
RULE declarations in files of type `lido' describe the
structure of the abstract syntax tree over which computations are
performed. Eli will create a routine to construct an abstract syntax
tree if any tree computations
are specified
(see Tree Structure of LIDO -- Computations in Trees).
In order to do this, Eli must be able to deduce a unique correspondence
between the concrete and abstract syntaxes, such that for each rule
in the concrete syntax it is possible to uniquely determine what
abstract syntax tree fragment to build. The tool within Eli that
does this is called Maptool. In addition to generating a routine to
construct the abstract syntax tree, Maptool will also deduce complete
versions of the concrete and abstract syntaxes if only incomplete versions
of each are provided by the user. This can only be done if the two
syntaxes can together form a complete syntax.
The concrete syntax is provided by the user in files of type `con'.
Since EBNF constructs are allowed in these files, they are first translated
into their strict BNF equivalents before being processed by Maptool
(see Using extended BNF to describe more complex rules). The abstract syntax is extracted from the RULE
declarations made in files of type `lido'
(see Rule Specifications of LIDO - Reference Manual).
The remainder of this section will discuss how Maptool deduces the
correspondence between the two syntaxes, the use of files of type `map'
to influence the mapping process, and some usage hints.
Maptool begins by matching any LISTOF constructs that appear
in the abstract syntax to any appropriate concrete rules. The next
phase examines each concrete rule not matched in the previous phase
and tries to find a matching abstract syntax rule. After all matching
is complete, unmatched concrete rules are added to the abstract syntax
and unmatched abstract rules are added to the concrete syntax. There
are a few exceptions to this as are noted in the remainder of this
section.
While the most obvious benefit to having Maptool deduce syntax fragments
from one syntax and place them in the other is to reduce the amount
of typing required, the more important advantage is the support it
gives for incremental development. It allows the user to only specify
those portions of the syntax with which they are concerned at the moment.
Chain rules have different behavior than other rules during the matching
process. Descriptions for three different kinds of chain rules are given here
to assist in the explanations given in the remainder of this section:
- Chain Rule
- A normal chain rule is a rule in which there is exactly one symbol on
the right hand side of the rule that is not equivalent to the left
hand side. For example, `X ::= Y' where X is not equivalent to Y
is a chain rule.
- Trivial Chain Rule
- A trivial chain rule is a chain rule in which the left hand side is
equivalent to the right hand side. This typically happens when a symbolic
equivalence class is defined that includes both the left hand side symbol
and the right hand side symbol (see Specifying symbolic equivalence classes).
- Literal Chain Rule
- A literal chain rule is similar to a trivial chain rule, except that it
also has literal symbols on its right hand side. A typical example of
this is the rule `Expr ::= '(' Expr ')''.
Based on the above definition for normal chain rules, we define coercions
between symbols. A symbol X can be coerced to a symbol Y if
there is a chain rule with X on the right hand side and Y on
the left hand side. Coercions are also transitive. If X is coercible
to Y and Y is coercible to Z , then X is also
coercible to Z . A symbol is also considered coercible to itself.
The LISTOF construct denotes zero or more occurrences of the
elements that appear on its right hand side. It does not dictate
the ordering of those right hand side symbols or any delimiters that
may be used to separate them. The ordering and delimiters are determined
by concrete rules. In simple terms, Maptool begins with the left hand
side of the LISTOF and recursively matches rules until it finds
the right hand side elements. The next paragraph gives a more precise
description.
An abstract LISTOF construct is matched by starting with the
symbol on the left hand side of the LISTOF. All concrete rules with
equivalent left hand side symbols are added to the set of matched rules.
For each rule added to the set, the right hand side symbols are examined.
Of these symbols, literal symbols are ignored. If terminal symbols are
encountered that aren't coercible to the symbols appearing on the right
hand side of the LISTOF , an error is signaled, because the left
hand side of the LISTOF may not derive symbols other than those that
appear on the right hand side. For each nonterminal symbol that isn't
coercible to one of the right hand side symbols, the concrete rules that
have that symbol on their left hand side are added to the set. The process
continues until no more rules can be added to the set.
The intermediate nonterminal symbols that are encountered as new concrete
rules are added to the set may not appear on the right hand side of other
concrete rules.
If Maptool doesn't find any concrete rules to match a LISTOF , it
will generate a canonical left recursive representation. For the
list:
RULE: Program LISTOF Declaration | Statement END;
Maptool would generate the following:
Program: LST_Program .
LST_Program: LST_Program Declaration .
LST_Program: LST_Program Statement .
LST_Program: .
This specifies zero or more occurrences of Declaration 's and
Statement 's.
There is one other important thing to note about the LISTOF construct.
Attribute computations associated with a LISTOF construct can
just as easily be written as symbol computations on the symbols of the
LISTOF . The advantage to using the LISTOF construct is
that it becomes possible to generate an abstract syntax tree structure
which allows for more efficient traversal. In order to construct this
special tree structure, it is sometimes necessary to insert an additional
chain rule into the concrete syntax at the root of the LISTOF .
This is the case when the rules matching the LISTOF have a recursive
occurrence of the left hand side symbol. As an example, the LISTOF
construct shown above might be written as follows in the concrete syntax:
Program: Program Declaration .
Program: Program Statement .
Program: .
As you can see, the root of the LISTOF , Program is used both
on the left hand side and right hand side of rules that match the LISTOF
construct, meaning that it is used recursively. If the LISTOF
construct is provided in a `.lido' file, Maptool must introduce the
chain rule `Program ::= LST_Program' and change other occurrences of
Program to LST_Program in order to build the efficient tree
structure.
Users should be aware that it is possible for the addition of this chain
rule to cause LALR(1) conflicts for the parsability of the concrete syntax
that do not appear in the absence of the LISTOF construct.
In these cases, users must either rewrite the concrete syntax or
avoid the use of the LISTOF construct to avoid the problem.
After all LISTOF constructs have been matched, Maptool attempts
to match the remaining concrete rules to rules given in the abstract
syntax. A match is determined if the signature of the concrete rule
is equivalent to the signature of an abstract rule or coercions
(see Chain rule definitions) exist between any symbols which differ in the
signatures. Remember that symbolic equivalence classes are applied
to concrete rules before this matching takes place, so symbols in the
signatures are considered equivalent if they belong to the same equivalence
class.
For example, consider the following abstract rules:
RULE: Declaration ::= IdDef Type END;
RULE: IdDef ::= Identifier END;
The following concrete rule will match the first of the above abstract
rules, because of the coercion defined between Identifier and
IdDef :
Declaration: Identifier Type .
The reason for doing this is to distinguish semantically between
occurrences of Identifier 's in different contexts. In the above
example, we have used IdDef to represent a definition of an
Identifier . In another place in the grammar, we may want to
refer to uses of identifiers instead and use the symbol IdUse .
Note that use of chain rules in the manner just described makes it
impossible to perform attribute computations during tree construction
(see Constraints on grammar mapping).
It is possible for Maptool to detect multiple possible matching abstract
rules for a single concrete rule. Maptool signals an error in this case
that must be fixed by changing the grammar to disambiguate the contexts.
After rule matching is complete, unmatched concrete rules, except
trivial chain rules and literal chain rules (see Chain rule definitions)
are added to the abstract syntax. The reason for this is that
trivial chain rules are meaningless in the abstract syntax and literal
chain rules are only meaningful if they have attribute computations
associated with them, in which case they would already have been
specified as part of the abstract syntax.
Sometimes it is desirable to include literal chain rules in the abstract
syntax even when the user has not explicitly included them there. A
typical situation where this occurs is when generating output conforming
to the concrete syntax using the Idem tool
(see Textual unparser of Abstract Syntax Tree Unparsing).
In this situation the output must contain all
literals hence the literal chain rules must be in the abstract syntax so
that Idem can generate output patterns for them. To preserve the
literal chain rules in the abstract syntax use the MAPCHAINS
keyword in a specification (see Preserving literal chain rules).
Unmatched abstract rules are included in the concrete syntax except in
the following instances:
-
The rule is a chain rule whose left hand side is not a symbol in
the concrete syntax. Adding the rule to the concrete syntax in this
case would cause the concrete syntax to be disconnected.
-
The rule can only be part of a computed subtree
(see Computed Subtrees of LIDO - Reference Manual).
This is true if the rule is only reachable from the root symbol
if symbols preceded by a $ are included.
Users can use the :consyntax product
(see consyntax of Products and Parameters) to view the
complete version of the concrete syntax.
The :abstree product
(see abstree of Products and Parameters) is used to view
the complete abstract tree grammar. The :absyntax product
(see absyntax of Products and Parameters) by contrast only
shows the abstract syntax rules which are not part of computed subtrees.
Files of type `map' can be provided by the user to influence the
way in which certain rules are matched. The syntax of map files
can be found with other grammar description towards the end of this
document (see Grammars for the Specification Files).
There are currently three ways in which the mapping can be affected.
The first are symbolic equivalence classes, which group together
symbols that have the same semantic meaning. The second method is to
map specific rules. Using this method, concrete rules can be rewritten
and/or reordered to match a specific abstract rule. The third method
controls the elimination of literal chain rules.
Symbolic equivalence classes are used to group together symbols
appearing in the concrete syntax because the semantics of the symbols
are equivalent. As a result, a single symbol can be used to represent
all of the members of the symbolic equivalence class in the abstract
syntax. This representative symbol can either be one of the concrete
symbols or a new symbol altogether. Symbolic equivalence classes are
specified in files of type `map'. A series of symbolic
equivalences must be preceded by the keyword MAPSYM . An
equivalence class is then specified by giving the representative symbol
(the symbol to appear in the abstract syntax), followed by ::= and
the list of symbolically equivalent symbols from the concrete syntax
terminated by a period. For example, the following specification says
that a Primary , Factor , and Expr belong to the same
equivalence class:
MAPSYM
Expr ::= Primary Factor .
Application of symbolic equivalence classes to rules in the concrete syntax
is done before the matching process begins. Symbolic equivalence classes
can only be created for symbols which are either all nonterminals or
all terminals (see How to describe a context-free grammar). An error message will also be issued
if a symbolic equivalence class specification includes abstract syntax
symbols on the right hand side, since each abstract syntax symbol
represents its own equivalence class.
For backward compatibility with previous releases of Eli, symbolic
equivalence classes may also be specified in files of type `sym'.
Rule mapping allows users to rewrite a concrete rule for the purposes
of matching it to a specific abstract rule. This is useful in cases
where two syntactically different constructs are semantically equivalent.
Consider the following expression language with bound identifiers:
Computation: LetExpr / WhereExpr .
LetExpr: 'let' Definitions 'in' Expr .
WhereExpr: Expr 'where' Definitions .
In this example, LetExpr and WhereExpr are semantically
equivalent constructs, but the ordering of Definitions and Expr
are reversed and they use different literal symbols.
We'd like to only specify the semantic computations for the two constructs
once. To do this, we can define a symbolic equivalence class for
LetExpr and WhereExpr :
MAPSYM
BoundExpr ::= LetExpr WhereExpr .
The abstract rule that we can use to represent the two constructs is:
RULE: BoundExpr ::= Definitions Expr END;
Finally, we must use rule mapping specifications to rewrite the two
concrete rules to match the abstract rule:
MAPRULE
LetExpr: 'let' Definitions 'in' Expr < $1 $2 > .
WhereExpr: Expr 'where' Definitions < $2 $1 > .
The keyword MAPRULE precedes a group of rule mapping specifications
in the map file.
The concrete rule to be rewritten is specified exactly as it appears in the
concrete syntax, and is followed by a description of the right-hand side of
the abstract rule enclosed in angle brackets.
Any literal symbols in the abstract rule are represented in the description
by themselves.
Literal symbols appearing in the abstract rule need not have counterparts
in the concrete rule, and literal symbols in the concrete rule need not
appear in the abstract rule.
Nonliteral symbols (both nonterminals and nonliteral terminals) in the
right-hand side of the abstract rule must have counterparts in the
right-hand side of the concrete rule.
Each nonliteral symbol in the right-hand side of the abstract rule is
represented by a `$' followed by the index of its counterpart in the
right-hand side of the concrete rule.
There is no requirement that all nonliteral symbols from the concrete rule
appear in the abstract rule, and no requirement on the order of appearance,
but duplicates are not allowed.
Here is another example of a concrete rule:
Exp: Lambda Idn '.' Exp .
This rule has three nonliteral symbols: Lambda , Idn , and
Exp .
Lambda and Idn are terminals; Exp is a nonterminal.
Suppose that our abstract syntax has the following rules:
RULE: IdDef ::= Idn END;
RULE: Exp ::= IdDef 'is bound in' Exp END;
The first abstract rule tells us that an Idn may be interpreted as
an IdDef in certain contexts (such as that represented by the second
abstract rule).
We can map the concrete rule given above into the second abstract rule,
since the second abstract rule expresses the semantics of the concrete rule:
MAPRULE
Exp: Lambda Idn '.' Exp < $2 'is bound in' $3 > .
Here the concrete rule is given, and the right-hand side of the abstract
rule is described in the angle brackets.
The nonliteral terminal symbol Lambda is dropped (positional
parameter $1 does not appear in the angle brackets), and the literal
terminal symbol '.' is replaced by 'is bound in' .
An abstract syntax will sometimes have several rules with different names
but identical signatures.
For example, consider the case where dyadic expressions are represented by
abstract rules that do not contain operators:
RULE Add: Expression ::= Expression Expression END;
RULE Mul: Expression ::= Expression Expression END;
...
In this case, the rule mapping must specify the abstract rule name
explicitly in order to disambiguate the pattern match:
MAPRULE
Expression: '(' Expression '+' Expression ')' < $1 $2 >: Add .
Expression: '(' Expression '*' Expression ')' < $1 $2 >: Mul .
...
Rule names are optional, and may be omitted when the pattern match is
unambiguous (as in the earlier examples).
When rule matching proceeds, the concrete rule is seen in its rewritten
form. An abstract syntax rule must exist in a LIDO
specification that corresponds to the rule mapping specification given.
Note that the use of rule mapping makes it impossible to perform
attribute computations during tree construction (see Constraints on grammar mapping).
The mapping process normally does not include literal chain rules in the
complete abstract syntax unless they appear in the user-supplied
abstract syntax (see Complete generated concrete and abstract syntaxes). Sometimes it is desirable to
preserve literal chain rules even if the user has not included them in
the abstract syntax. To force literal chain rules to be included in the
abstract syntax, use the MAPCHAINS keyword. The behavior is
unchanged if all literal chain rules already appear in the abstract syntax.
Care should be taken when using MAPCHAINS in conjunction with
attribution. A specification using this keyword may require
more attribution than the same specification without it, because it
may be necessary to transfer attribute values from the child to the parent
or vice versa. The presence of symbol computations for
the symbols occurring in the chain rules without the transfer computations
just mentioned may result in incorrect attribution without warning.
The generation of the parsing grammar (the input to the parser) may be
influenced by BOTTOMUP specifications
(see Computations of LIDO - Reference Manual)
specified in your attribute grammar. This is because the parsing grammar
must ensure that the nodes of the abstract syntax tree are constructed
in a particular order in the presence of BOTTOMUP constraints.
In order to deal with this, Maptool must sometimes inject generated
chain rules into the parsing grammar to which tree building actions can
be attached. These injected chain rules may cause the parsing grammar
to exhibit LALR(1) conflicts. If so, an error will be reported to
indicate that the BOTTOMUP constraints you have provided cause
your grammar to not be parsable.
In trying to resolve such a conflict, it is useful to use the :pgram
derivation (see pgram of Products and Parameters) to be able to
view the parsing grammar that is submitted to the parser generator and
contains the injected chain rules. It is also useful to use the
:OrdInfo derivation to get more information about how
BOTTOMUP constraints were introduced for specific rules.
Approaches to resolving such a problem include eliminating unnecessary
BOTTOMUP constraints from the attribute grammar or making changes
to the concrete syntax that allow the chain rules to be injected without
causing LALR(1) conflicts.
This section begins by describing typical patterns of syntax development.
This is followed by two more specific examples of how to use the mapping
techniques described in the previous sections.
When developing a translator for an existing language, the complete
concrete syntax is typically already available. In these cases,
it is advantageous to start with the complete concrete syntax and
add symbolic equivalences and rule mapping specifications to suit
the attribute computations as they are being developed.
On the other hand, when designing a new language, it is easier to start
work by specifying attribute computations and adding concrete syntax rules
as necessary to resolve issues of precedence, associativity, and other
parsing ambiguities.
When errors relating to the syntax appear, it is strongly recommended that
the first course of action be to look at the complete generated versions of the
syntaxes by using the :consyntax , :absyntax , and
:abstree products (see Specifications of Products and Parameters).
Very often these problems are simply a result
of not correctly anticipating the matching process.
The LIGA attribute grammar system allows users to specify
that the first pass of computations are to be performed as the
abstract syntax tree is being built. This is specified either
by an option given in a LIGA control specification
see Order Options of LIGA Control Language or by using
an additional keyword in an attribute grammar computation
see Computations of LIDO - Reference Manual.
Combining computations with tree construction, however, requires
that the tree be constructed in strict left-to-right and
bottom-to-top order. In the presence of more advanced grammar
mappings, it is not possible to maintain this strict ordering.
For this reason, Maptool generates the LIGA control directive:
ORDER: TREE COMPLETE ;
when it detects that one of these grammar mappings is required.
The control directive indicates that the tree
should be constructed completely before any computations
take place.
The grammar mappings which cause Maptool to emit these directives
are the use of chain rules in the abstract syntax that do
not exist in the concrete syntax (see Matching remaining rules)
and any use of rule mapping (see Specifying rule mappings).
Aside from symbolic mappings (see Specifying symbolic equivalence classes) and the
use of LISTOF constructs,
the generated concrete and abstract syntaxes need to be identical in order
to allow computations to take place during tree construction.
Literal terminals often distinguish phrases whose structures are identical
except for the particular literal terminal.
For example, in a normal arithmetic expression the phrase describing
addition and the phrase describing subtraction are identical except for
the literal + or - .
Taking nonterminal equivalence classes into account, it may be that
all phrases representing operations with two operands are identical
except for the operator literal.
When phrases have identical structure except for one or more literals,
the tree computations carried out at the nodes corresponding to those
phrases are often identical except for some parameter that depends on the
particular literal.
It is then useful to abstract from the distinct literals,
obtaining a single phrase with which to associate the computation and
a set of phrases with which to associate the parameter evaluation.
The key point here is that in many cases the computation will apply to a
wide variety of translation problems, whereas the particular set of
literals characterizes a single translation problem.
By abstracting from the distinct literals, the computation can be reused.
To abstract from a specific literal, simply replace that literal with a
nonterminal and add a production that derives the literal from that
nonterminal.
This added production represents the phrase with which the parameter
evaluation would be associated.
The computation for the phrase in which the literal was replaced by
the nonterminal will now obtain the parameter value from the corresponding
child, rather than evaluating it locally.
It is quite common for a single operator to have different meanings that
depend on the types of its operands.
For example, in Pascal the operator + might mean integer addition,
real addition or set union.
There are well-known techniques for deciding what is meant in a particular
context, and these techniques depend only on the particular set of operators
and operand types.
The computations themselves are parameterized by this information
(see Selecting an operator at an expression node of Type Analysis).
In order to reuse the tree computation to resolve overloading,
abstract from the particular set of literals
that represent the operators of the language.
Then define equivalence classes in which every nonterminal representing an
expression is replaced by Expr and every nonterminal representing an
operator by Op .
Finally, associate the appropriate computations with the following rules:
Expr: Expr Op Expr.
Expr: Op Expr.
Expr: Identifier.
Expr: Integer.
...
(Here ... indicates rules for other denotations, such as
floating-point numbers, true , etc., defined in the language.)
As an example of the process, consider a language with integer and Boolean
expressions in the style of Pascal.
The literals that represent operators in this language are + ,
- , * , / , div , mod , and ,
or and not .
Define a new nonterminal for each precedence level of the dyadic operators,
one for the unary arithmetic operators, and one for not :
Addop: '+' / '-' / 'or' .
Mulop: '*' / '/' / 'div' / 'mod' / 'and' .
Sign: '+' / '-' .
Notop: 'not' .
These productions abstract from the literals, and embody the information
about the precedence and association (all operators are left-associative)
needed to determine the phrase structure.
Using these new nonterminals, define the phrase structure of an expression:
SimpleExpression: Sign Sum / Sum .
Sum: Sum Addop Term / Term .
Term: Term Mulop Factor / Factor .
Factor: Notop Factor / Primary .
Primary: Integer / Id / 'true' / 'false' / '(' SimpleExpression ')' .
(Here Integer is a terminal representing arbitrary digit sequences
and Id is a terminal representing arbitrary identifiers.
These symbols will be recognized by the lexical analyzer.)
All of the dyadic operators fall into the same equivalence class, which
should be represented by the symbol Binop .
Sign and Notop both belong to the Unop class, and
SimpleExpression , Sum , Term , Factor ,
Primary are in the Expr class.
Here is a type-`map' file defining these classes:
MAPSYM
Op ::= Addop Mulop Sign Notop .
Expr ::= SimpleExpression Sum Term Factor Primary .
|