General Information
Tutorials
Reference Manuals
Libraries
Translation Tasks
Tools
Administration
|
Syntactic AnalysisContext-Free Grammars and ParsingA context-free grammar is a formal system that describes a language by specifying how any legal text can be derived from a distinguished symbol called the axiom, or sentence symbol. It consists of a set of productions, each of which states that a given symbol can be replaced by a given sequence of symbols. To derive a legal text, the grammar is used as data for the following algorithm:
When this algorithm terminates, Given a context-free grammar that satisfies certain conditions, Eli can generate a parsing routine to determine the derivation (and hence the phrase structure) of any legal text. This routine will also automatically detect and report any errors in the text, and repair them to produce a correct phrase structure (which may not be that intended by the person who wrote the erroneous text).
How to describe a context-free grammarEach production of a context-free grammar consists of a symbol to be replaced and the sequence that replaces it. This can be represented in a type-`con' file by giving the symbol to be replaced, followed by a colon, followed by the sequence that replaces it, followed by a period:
Assignment: Variable ':=' Expression. StatementList: . Statement: 'if' Expression 'then' Statement 'else' Statement.
The first production asserts that the symbol
Representations of symbols to be replacedSymbols that are to be replaced are called nonterminals, and are always represented by identifiers. (An identifier is a sequence of letters and digits, the first of which is a letter.) Every nonterminal must appear before a colon in at least one production. The axiom is a nonterminal that appears before the colon in exactly one production, and does not appear between the colon and the period in any production. There must be exactly one nonterminal satisfying the conditions for the axiom.
Representations of character stringsSymbols that cannot be replaced are called terminals, and may be represented by either identifiers or literals. (A literal is a sequence of characters bounded by apostrophes ('). An apostrophe appearing within a literal is represented by two successive apostrophes.) No terminal may appear before a colon in any production. Terminals represent character strings that are recognized by the lexical analyzer (see Specifications of Lexical Analysis).
Using extended BNF to describe more complex rulesExtended BNF allows the use of certain operators on the right hand side of a production. These operators are designed to be short-hands to simplify the grammar description. Rules with extended BNF operators can be translated into rules which use only the strict BNF constructs described so far. While the use of extended BNF constructs is supported for the concrete syntax description in Eli, only strict BNF constructs are allowed in the abstract syntax. When it comes time to deduce the correspondence between the concrete and abstract syntax, Maptool operates on the abstract syntax and a version of the concrete syntax in which all rules containing extended BNF constructs have been translated into equivalent strict BNF rules. The remainder of this section is devoted to describing how each of the extended BNF constructs are translated to their strict BNF equivalents. Note that most of the EBNF constructs require the introduction of generated symbols for their strict BNF translation. Users are strongly discouraged from using these constructs in instances where attribution is required for those contexts, because changes in the grammar will change the names of the generated symbols used.
The most appropriate use of EBNF constructs that introduce generated
symbols is when matching the LIDO
Collecting replacements for the same symbolWhen a grammar contains many productions specifying replacement of the same nonterminal, a slash, denoting alternation can be used to avoid re-writing the symbol being replaced:
Statement: Variable ':=' Expression / 'if' Expression 'then' Statement 'else' Statement / 'while' Expression 'do' Statement .
This alternation specifies three productions.
The nonterminal to be replaced is
Statement: Variable ':=' Expression . Statement: 'if' Expression 'then' Statement 'else' Statement . Statement: 'while' Expression 'do' Statement . Alternation does not introduce any generated symbols and has a very straight-forward translation. As a result, it is the most heavily used of the EBNF constructs.
Describing optional symbols within a rule
Square brackets are used to denote that the set of symbols
enclosed by the brackets are optional. In the following
example,
Program: [Constants] [Variables] Body . The strict BNF translation of this construct is to generate a rule for each possible permutation of the right hand side. In the case of the above example, the following four rules would result:
Program: Body . Program: Variables Body . Program: Constants Body . Program: Constants Variables Body . While the translation doesn't introduce any generated symbols, indiscriminate use of this construct may lead to less readable specifications.
Specifying zero or more occurrences
An asterisk (or star) is used to denote zero or more occurrences
of the phrase to which it is applied. In the following example,
Program: Variable* Body .
The strict BNF translation of this construct requires the introduction
of a generated symbol. Generated symbols begin with the letter
Program: G1 Body . G1: G1 Variable . G1: .
Specifying one or more occurrences
A plus is used to denote one or more occurrences
of the phrase to which it is applied. In the following example,
Program: Variable+ Body . The strict BNF translation of this construct is similar to the translation of the asterisk (see Specifying zero or more occurrences). The translation for the above example is as follows:
Program: G1 Body . G1: G1 Variable . G1: Variable .
EBNF Separator construct
A double slash is used to denote one or more occurrences of a phrase
separated by a symbol. In the following example,
Input: Declaration // ',' . The strict BNF translation for the above example is as follows:
Input: G1 . G1: G2 . G1: G1 ',' G2 . G2: Declaration . Note that all of the EBNF constructs, except the single slash (for alternation) have higher precedence than the separator construct.
Grouping constructs in EBNF expressionsParentheses are used to group EBNF constructs. This is used primarily to apply other EBNF operators to more than a single symbol. For example:
Program: (Definition Use)+ .
In this example, we want to apply the Plus operator to the concatenation of
a
Program: G2 . G1: Definition Use . G2: G1 . G2: G2 G1 . This is identical to the translation for the Plus operator operating on a single symbol, except that another generated symbol is created to represent the parenthetical phrase. Note that a common error is to introduce parentheses where they are not needed. This will result in the introduction of unexpected generated symbols.
Using structure to convey meaning
A production is a construct with two components: the symbol to be replaced
and the sequence that replaces it.
We defined the meaning of the production in terms of those components,
saying that whenever the symbol was found in The context-free grammar for a language specifies a "component" relationship. Each production says that the components of the phrase represented by the symbol to be replaced are the elements of the sequence that replaces it. To be useful, the context-free grammar for a language should embody exactly the relationship that we use in defining the meanings of the constructs of that language.
Operator precedenceConsider the following expressions:
A + B * C (A + B) * C
In the first expression, the operands of the addition are the variable
The general method for embodying this concept of operator precedence in a context-free grammar for expressions is to associate a distinct nonterminal with each precedence level, and one with operands that do not contain "visible" operators. For our expressions, this requires three nonterminals:
The productions that embody the concept of operator precedence would then be:
Sum: Sum '+' Term / Term. Term: Term '*' Primary / Primary. Primary: '(' Sum ')' / Identifier.
Operator associativityConsider the following expressions:
A - B - C A ** B ** C A < B < C
Which operator has variable
This question can be answered by stating an association
for each operator:
If Association rules are embodied in a context-free grammar by selecting appropriate nonterminals to describe the operands of an operator. For each operator, two nonterminals must be known: the nonterminal describing expressions that may contain that operator, and the nonterminal describing expressions that do not contain that operator but may be operands of that operator. Usually these nonterminals have been established to describe operator precedence. Here is a typical set of nonterminals used to describe expressions:
The association rules discussed above would therefore be expressed by the following productions (these are not the only productions in the grammar):
Sum: Sum '-' Term. Factor: Primary '**' Factor. Relation: Sum '<' Sum.
The first production says that the left operand of
Scope rules for declarationsIdentifiers are normally given meaning by declarations. The meaning given to an identifier by a particular declaration holds over some portion of the program, called the scope of that declaration. A context-free grammar for a language should define a phrase structure that is consistent with the scope rules of that language.
For example, the declaration of a procedure Now consider the following productions describing a procedure declaration:
procedure_declaration: 'procedure' procedure_heading procedure_body. procedure_heading: ProcIdDef formal_parameter_part ';' specification_part.
Notice that the phrase structure induced by these productions is
inconsistent with the postulated scope rules.
The declaration of
procedure_declaration: 'procedure' ProcIdDef ProcRange. ProcRange: formal_parameter_part ';' specification_part procedure_body.
Here the formal parameters and the body have both been made components of a
single phrase (
|