Exactly 1800 Words on Languages and Parsing
Terence Parr
Excerpt from:
Language Translation
Using PCCTS & C++
by Terence J. Parr
Published by Automata Publishing Company
Publication date: January 1997
ISBN: 0962748854
We give only a taste of language theory here and in a
very loose fashion. However, it should give you enough
information and define enough terms to get you through
the rest of the book.
In the Spring of 1983, as first year computer science
students at Purdue University, we were assigned the
problem of recognizing arithmetic expressions, which
could include nested parentheses. We were given a
specification that described what the expressions looked
like and were asked to produce a Pascal program that
recognized such expressions. The specification looked
something like
expr -> factor
factor-> term ( "+" term )*
term -> atom ( "*" atom )*
atom -> "(" expr ")"
atom -> INTEGER
atom -> IDENTIFIER
where INTEGER and IDENTIFIER were shorthands for
strings of digits and strings of letters, respectively;
"( "+" term )*" indicated that zero
or more "+" term sequences could be seen. [Or
did we get term and factor mixed up?]. The rules
described the structure of small pieces of the expression
language. For example,
term -> atom ( "*" atom )*
was read by saying, "a term is an atom followed
by zero or more ""*" atom"
sequences." We remember thinking what a marvelously
precise means of describing an infinitely large set of
input strings.
We decided that our program would have one function to
recognize each rule in the grammar to keep things nice
and neat. In this manner, references to rules would
become, possibly recursive, procedure calls in our
program. References to actual input strings such as
"(" and INTEGER were all hard coded to eat
white space and look for the particular string. This got
to be repetitive and so we decided to factor out the
common operations among all input string matching code.
Further, it seemed easier to treat input strings as
single "words" when trying to match the
grammatical structure of the expressions. We eventually
came up with a function called getword that returned an
integer describing what input vocabulary word was found.
It also made sense to assume that some variable would
always hold the next word to be matched; that is, after
matching a word, the variable would be set to the result
of calling getword again.
With the benefit of our current knowledge, now we
would say that we were provided with a term definitiongrammar
consisting of a set of term definitionrules that
specified the set of all possible sentences in the
expression term definitionlanguage; that is, the
grammar specified the Syntax, term definitionsyntax
of the language. Rules with more than one alternative
were considered to have multiple Production, term
definitionproductions. Our program was a parser
that made calls to a analyzerlexical analyzer or
scanner (our getword function) that broke up the input
character stream into term definitionvocabulary
symbols, or tokens. The program we built was a
classic example of a recursive-descent parser. A
generic term for this type of parsing is top-down
because when you look at the parse tree, the parse starts
at the top (the start symbol) and works its way down the
tree. Recursive-descent parsers are a set of mutually
recursive procedures that normally use a single symbol of
lookahead to make parsing decisions. For example,
rule atom could be encoded in C as
int atom()
{
// use lookahead to decide which alternative applies
switch ( current_token ) {
case LPAREN : // -> "(" expr ")"
current_token = getword();
expr();
if ( current_token != RPAREN ) error-clause;
current_token = getword();
break;
case INTEGER : // -> INTEGER
current_token = getword();
break;
case IDENTIFIER : // -> IDENTIFIER
current_token = getword();
break;
default :
error-clause (missing LPAREN, INTEGER, or IDENTIFIER)
}
}
where the variable current_token is the lookahead
symbol.
LL(1)Parsers that follow this simple formula can be
classified as LL(1), which is a shorthand indicating that
the input is matched from left-to-right (as opposed to
backwards) and that parsing decisions are made on the
left edge of alternative productions with 1 symbol of
lookahead. This amounts to saying that LL(1) parsers must
predict which alternative production will be successfully
matched by examining only the set of tokens that can be
matched first by each production. We loosely define this
set of tokens that predicts alternatives to be the lookahead
set. Normally, this is set of tokens that can be
matched first by a production p and is called FIRST(p);
e.g.,
FIRST("(" expr ")")
is the singleton set {"("}. Occasionally,
the FOLLOW set is used to predict alternatives.
FOLLOW(r) is the set of all tokens that can be matched
following references to rule r. For example, given the
grammar
rule -> optional_ID SEMICOLON
optional_ID -> IDENTIFIER
optional_ID ->
FOLLOW(optional_ID) is { SEMICOLON } because SEMICOLON
follows the reference to optional_ID. The lookahead set
for the empty production is defined to be the FOLLOW of
the invoking rule. Therefore, SEMICOLON predicts the
empty production of optional_ID.
When the lookahead sets from alternative productions
are not disjoint, we say that the parsing decision is nondeterministic
or ambiguous. In other words, there is at least
one token that predicts more than one alternative. Most
of the time, this is a bad thing.
LL(1) parsers may be generalized to LL(k) for k>1.
For example, the following grammar is ambiguous upon
token A:
a -> A B C
a -> A D E
where A, B, C, D, and E are some vocabulary tokens.
Because both productions have a common prefix of A, an
LL(1) parser could not determine which production was
going to successfully match. However, if the parser could
see ahead to both the A and what followed A on the input
stream, the parser could determine which production was
going to match. An LL(2) parser is such a creature;
hence, rule a is unambiguous in the LL(2) sense. A
grammar for which a deterministic LL(k) parser can be
built is LL(k). A language for which an LL(k) grammar
exists is LL(k).
Because recursive-descent parsers are just piles of
code, more sophisticated predictions can be made than
simple lookahead buffer comparisons. For example, what if
two productions are exactly alike syntactically, but are semantically
different? What if the productions have different
meanings (usually depending upon context or other
information)? Consider the following grammar:
element -> ID "(" expression_list ")"
// array reference
element -> ID "(" expression_list ")"
// procedure call
It is perfectly reasonable to separate these two cases
because while they look the same, array references and
procedure calls are very different semantically. The
definition of the ID must be consulted to determine which
production to match. In a hand-built parser, you could do
this:
element()
{
if ( current_token == ID && isarray(current_text) ) {
match an array reference;
}
else if ( current_token == ID && isprocedure(current_text) ) {
match a procedure call;
}
else error;
}
where isarray(current_text) is some function you have
defined that returns true if the ID is previously defined
as an array; isprocedure would be defined similarly. We
call such a parser a predicated LL(k) parser, pred-LL(k),
because at least one parsing decision was predicated upon
information not available to a pure LL(k) parser. The
forms isarray(current_text) and isprocedure(current_text)
are considered "semantic predicates. We could
modify the grammar as follows:
element -> <<isarray(current_text)>>? ID "("expression_list ")"
element -> <<isprocedure(current_text)>>? ID "("expression_list ")"
where <<...>>? is a semantic predicate in
ANTLR 1.xx notation.
Pred-LL(k) parsing covers another type of predicated
parsing decision. Consider the following grammar:
a -> (A)* B
a -> (A)* C
No matter how large we make the k of LL(k), a sequence
of k+1 A's could always be presented to the parser, and
the parser could not see past the A's to the B or C. This
grammar then is non-LL(k) for any fixed value of k.
Because there are real grammars that might have
constructs requiring arbitrary lookahead, we introduced
another type of predicate called Syntactic predicatesyntactic
predicates. A syntactic predicate specifies a grammar
fragment that uniquely predicts the associated
production. The grammar could be modified as follows:
a -> ( (A)* B )? (A)* B
a -> (A)* C
which indicates that, to predict the first production,
zero or more A's must be seen followed by a B. If this
syntactic predicate fails, the second production will be
attempted by default; hence, no predicate is required at
its left edge. Clearly, this ability to scan arbitrarily
ahead, renders the class of pred-LL(k) languages much
larger than the class of LL(k) languages.
Bottom-up Parsers
And now, for something completely different: a bit
about the class of languages and parsers called LR(k).
LR(k) parsers are considered bottom-up parsers
because they try to match the leaves of the parse tree as
they work their way up the parse tree towards the start
symbol at the root. A simple way to illustrate LR parsing
is to consider a simple language as described the
following grammar.
a -> A B C
a -> A B D
Loosely speaking, an LR-based parser consumes input
symbols until it finds a viable complete production; for
the purposes of this discussion, all productions are
viable. Input token A would be tested against both
productions. Since A matches neither completely, another
input token would be consumed, and AB would be compared
against the productions. Again, a complete
right-hand-side would not be matched. The next input
symbol would be consumed, say token D. At this point, ABD
matches the second right-hand-side and the parser would
report that it had found input for rule a. The process of
consuming input is called Shiftingshifting, and
the process of matching complete right-hand-sides is
called Reducingreducing. (The right-hand-side is
reduced to the left-hand-side.) In our example, no
lookahead is required to determine that a valid sentence
was found because the entire production can be seen
before making a decision. Therefore, this grammar is
LR(0).
LR(k) recognizers (and their variants such as LALR(k))
are stronger than LL(k) recognizers because the LR
strategy uses more context information. For an LR parser,
the context consists of all grammar productions
consistent with the previously seen input. This context
often includes several "pending" grammar
productions. Intuitively, an LR(k) parser attempts to
match multiple productions at the same time and postpones
making a decision until sufficient input has been seen.
In contrast, the context for an LL parser is restricted
to the sequence of previously matched productions and the
position within the current grammar production being
matched. An LL(k) parser must make decisions about which
production to match without having seen any portion of
the pending productions - it has access to less context
information. Hence, LL(k) parsers rely heavily on
lookahead. We note that our LR(0) grammar is LL(3) as a
case in point.
On the other hand, our pred-LL(k) parsers are stronger
than LR(k) parsers for two reasons. First, semantic
predicates may be used to parse context sensitive
languages. Second, pred-LL(k) parsers have access to
arbitrary lookahead. Further, embedding actions in an LR
grammar can introduce ambiguities, thus reducing the
strength of LR.