Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



ANTLR 3.0 Tree Grammars, Tree Parsing

RSS Feed

June 28, 2005

Ok, last night I figured out how to handle rewrites during tree parsing. Simple! Just mirror what I do for token rewrite stream but for tree rewrite stream. Queue up a series of rewrites until the end and then alter the tree, giving it back to the user. Some people have mentioned doing this on a rule-by-rule basis for tree construction (during parsing) but this will work great on a grammar basis for tree rewriting too. :) I think it was Paul Lucas and Loring and Stanchfield who were talking about a related thing on a rule basis.

Anyway, seems tree parsing really works. Got it doing the simpleC example. Here it is in it's full glory

grammar SimpleC;
options {
    output=AST;
}

tokens {
    VAR_DEF;
    ARG_DEF;
    FUNC_HDR;
    FUNC_DECL;
    FUNC_DEF;
    BLOCK;
}

program
    :   declaration+
    ;

declaration
    :   variable
    |   functionHeader ';' -> ^(FUNC_DECL functionHeader)
    |   functionHeader block -> ^(FUNC_DEF functionHeader block)
    ;

variable
    :   type declarator ';' -> ^(VAR_DEF type declarator)
    ;

declarator
    :   ID 
    ;

functionHeader
    :   type ID '(' ( formalParameter ( ',' formalParameter )* )? ')'
        -> ^(FUNC_HDR type ID formalParameter+)
    ;

formalParameter
    :   type declarator -> ^(ARG_DEF type declarator)
    ;

type
    :   "int"   
    |   "char"  
    |   "void"
    |   ID        
    ;

block
    :   lc='{'
            variable*
            stat*
        '}'
        -> ^(BLOCK[$lc,"BLOCK"] variable* stat*)
    ;

stat: forStat
    | expr ';'!
    | block
    | assignStat ';'!
    | ';'!
    ;

forStat
    :   "for" '(' start=assignStat ';' expr ';' next=assignStat ')' block
        -> ^("for" $start expr $next block)
    ;

assignStat
    :   ID EQ expr -> ^(EQ ID expr)
    ;

expr:   condExpr
    ;

condExpr
    :   aexpr ( ("=="^^ | '<'^^) aexpr )?
    ;

aexpr
    :   atom ( '+'^^ atom )*
    ;

atom
    : ID      
    | INT      
    | '(' expr ')' -> expr
    ; 

FOR : "for" ;
INT_TYPE : "int" ;
CHAR: "char";
VOID: "void";

ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

INT :	('0'..'9')+
    ;

EQ   : '=' ;
EQEQ : "==" ;
LT   : '<' ;
PLUS : '+' ;

WS  :   (   ' '
        |   '\t'
        |   '\r'
        |   '\n'
        )+
        { channel=99; }
    ;    

Note that in the tree parser, it's just the parser grammar to the right of the -> for the most part!

tree grammar SimpleCTreeParser;
options {
    tokenVocab=SimpleC;
}

program
    :   declaration+
    ;

declaration
    :   variable
    |   ^(FUNC_DECL functionHeader)
    |   ^(FUNC_DEF functionHeader block)
    ;

variable
    :   ^(VAR_DEF type declarator)
    ;

declarator
    :   ID 
    ;

functionHeader
    :   ^(FUNC_HDR type ID formalParameter+)
    ;

formalParameter
    :   ^(ARG_DEF type declarator)
    ;

type
    :   INT_TYPE
    |   CHAR
    |   VOID
    |   ID        
    ;

block
    :   ^(BLOCK variable* stat*)
    ;

stat: forStat
    | expr
    | block
    ;

forStat
    :   ^(FOR expr expr expr block)
    ;

expr:   ^(EQEQ expr expr)
    |   ^(LT expr expr)
    |   ^(PLUS expr expr)
    |   ^(EQ ID expr)
    |   atom
    ;

atom
    : ID      
    | INT      
    ; 

June 27, 2005

Ok, so it looks like I have tree parsing working nicely. :) The code generation templates are essentially identical so it was fairly straightforward--I tricked the LL(*) analyzer into seeing A B as different from ^(A B) by adding imaginary navigation nodes DOWN and UP. Works great.

Now I've got to make a decision about rewrites in a tree parser. I.e., given a tree and option output=AST, should the tree parser build a completely new tree or should it just rewrite the current one? I'm trying to think of a situation where you want to keep a tree intact during a multiple phase translation. I suppose debugging is one reason, but then if you want to keep that tree you can easily just do a dupTree on it before you pass it to the rewrite phase. Further, that dup will be faster than having the parser dup nodes.

So, imagine you have

tree parser t;
options {output=AST;}
decl : ^(DECL type declarator) ;
type : "int" | "float" ;
declarator
     : ID -> ^(ASSIGN ID INT["0"])
     ;

This parses trees like

  DECL
  / \
int  a
and rewrites them to
  DECL
  /  \
int  ASSIGN
       / \
      a   0

The DECL and type nodes should stay intact; only the declarator result has changed. The ID node will be reused but merged into a new ^(ASSIGN ID INT["0"]) subtree that has two imaginary nodes. The result becomes the new declarator tree up in decl.

Ack! The problem with rewriting a tree is that you are modifying the tree you are trying to parse. This is ok, if you do only local rewrites but what if you use an action to alter nodes in the tree that will be parsed later? You could easily create an infinite loop where a rule creates nodes that are parsed later via that same rule which creates more nodes in the future ad naseum. Ugh.

One easy way to solve this is of course with memory. The TreeNodeStream object could build up an entire list of the nodes a priori, before the parse begins. In this way, the parse will proceed exactly as you expect even if you unlink / rewrite every node in the tree. But the cost is 4 bytes ptr per node and with 500,000 nodes as the example somebody gave, that's already 2G of RAM. :( This could be mapped to the disk like virtual memory, but...

Of course, creating a new tree of 500k nodes is even more expensive.

This "alter the future" is more common than you think. What if you're generating code for a translator to C from Java and you create memory in some statement of a block. As you do that you may want to generate a free() call at the end of the block. When the parser finishes the last "real" statement, it will start to parse your free() calls! The problem is that those are C statements not Java! The parser will puke.

The choice appears:

  1. don't build a tree and you can use the efficient TreeNodeStream that doesn't buffer
  2. DUP: build a new tree from the old tree; you can use the efficient TreeNodeStream that doesn't buffer
  3. REWRITE: TreeNodeStream buffers the entire sequence of nodes to visit in the tree before parsing begins and nodes are rewritten not dup'd.

Come to think of it, there could be some really strange bugs introduced if you rewrite a subtree that is later parsed using it's original structure (but it no longer has that stronger and some of your actions will fail!). Argh!

June 23, 2005

Just released ANTLR 3.0ea3, which has the tree construction and rewrite rules stuff...now on to tree grammars real quick.

Grammars

Programmers are used to having parsers recognize structure in single-dimensional streams of tokens. These parsers often build trees that are later walked in some manner to extract information or to generate a translation.

Tree parsers do the same thing except that they recognize structure in a two-dimensional stream of tree nodes. The key to unifying parsing and tree parsing is to serialize a tree into a token stream by adding imaginary tokens DOWN and UP to encode tree structure.

From a grammar point of view, though, a parser grammar and a tree grammar look identical except for the addition of ^(root children...) tree structure syntax. The syntax mirrors the construction syntax in token parser rewrite rules.

Here is a simple example that matches a list of declaration trees such as:

    DECL
    /  \
"int"   ID

Here's the grammar:

tree grammar LangTreeParser;
options {
    tokenVocab=Lang; // required: must say what token types are
}

program : decl+ ;

decl : ^(DECL type ID) ;

type : "int"
     | "float"
     ;

So, all I need to do is go into the ANTLR antlr.g parser that builds trees and have it turn

^(root child1 ... childN)

into

root DOWN child1 ... childN UP

and the rest of ANTLR will be none-the-wiser :) The LL(*) analysis algorithm will just simply work and provide k>1 lookahead for tree parsers. ANTLR will see the following two alternatives as different:

a : A B
  | ^(A B) // transformed to A DOWN B UP
  ;

To actually parse at runtime, I'll need a CommonTreeNodeStream that serializes a tree, but it should be easy. This part is brain dead simple.

However, what do labels mean and what can actions do?

Labels, types, ...

SO, in the following grammar fragment, what type is $t? What type is $ID?

a : t=type id=ID {$t ... $ID ...}
  ;

If this were a parser grammar, $ID would be the token matched for ID and $ID.tree is the tree created from the matched token. I sort of like the idea that I can do multi-stage parsing. Take a parser that parses a token stream and generates a token stream. Feed that stream to another parser (a tree parser) that feeds off a flat tree (token stream). So, take this grammar:

grammar foo;
options {output=AST;}
a : A B -> B A ;
A : 'a' ;
B : 'b' ;

and then grab the resulting tree, passing it to another parser:

tree grammar fooTree;
options {tokenVocab=foo;}
a : B A ;

that matches the altered token stream. I wonder if this could help with parsing C++ and other nasty beasts. Hmmm...not that common.

Actually, you could reuse the same grammar! I.e., the grammar only cares that it gets a token stream. It doesn't matter the source. You could have a parser eat its own output. Cute, but useful? Well, I like consistency...who knows what it will be useful for.

The reason I mention this is it affects the type of $ID. If you want the same parser to parse tokens and tree nodes, the actions will only work if the type of $ID is the same.

I guess you might do token stream to token stream (flat AST node stream) parsing, but the grammars would be different so you could make one a tree parser. Also, $ID cannot be assumed to be Token for tree parsing as your tree nodes may not track a token object as payload--you might, for example, inline all the data as fields rather than use a payload. I guess $ID.text could still work...

$ID is type Token in a parser grammar because it is the fundamental input element. That sort of implies that a tree garmmar should have $ID be a tree node, whatever your ASTLabelType is. It would be useful to define $ID.tree perhaps to make it easy to transfer a parser grammar to a tree grammar.

Tree Parser Template

The treeParser template should be virtually identical minus token type defs and TokenStream -> TreeNodeStream.

TreeParser object

Need a TreeParser. Hmmm....can't subclass Parser, but pulling out a shared GenericParser to reuse match() etc... can be a problem. It can use just input.LA(1), which will work when I use field:

IntStream input;

but the generated parser for token stream will need input.LT(1), which is undefined for the super-interface IntStream. Crap, I don't want all "label = input.LT(1);" assignments to be wrapped in a case (slow and ugly).

I know. Make any method that needs input to take it as an IntStream parameter as LA() is the only method needed. Code gen would change to match(input, tokenType) but I don't think the it will take much time to pass another parameter.

I use input.LT(1) a lot, but I could change the code gen so Token is replaced with Object for tree parsing...yeah, I guess that will work.

TreeNodeStream

Here is the interface that the tree parser would need to parse and yet keep all code generation identical for things like template tokenRef.

/** A stream of tree nodes, accessing nodes from a tree of some kind */
public interface TreeNodeStream extends IntStream {
    /** Get tree node at current input pointer + i ahead where i=1 is next node.
     *  i<0 indicates nodes in the past.  So -1 is previous node and -2 is
     *  two nodes ago. LT(0) is undefined.  For i>=n, return null.
     *  Return null for LT(0) and any index that results in an absolute address
     *  that is negative.
     *
     *  This is analogus to the LT() method of the TokenStream, but this
     *  returns a tree node instead of a token.  Makes code gen identical
     *  for both parser and tree grammars. :)
     */
    public Object LT(int k);

    /** Get a node at an absolute index i; 0..n-1. */
    public Object get(int i);

    /** Where is this stream pulling nodes from?  This is not the name, but
     *  the object that provides node objects.
     *
     *  TODO: do we really need this?
     */
    public Object getTreeSource();
}

A CommonTreeNodeStream would keep a stack of pointers into the tree and a stack of int indexes that indicated where (which node/child) the stream leaves off at each level of the tree.

AST to AST transformations

Copy or rewrite inline? Just posted email to the list in response to post by Akhilesh Mritunjai:

"Yup, in tree parser, and I vote for default in-line. As for creating a new tree, thats what 'buildAST=true' is there for. Am I right ? Coupled together with a..."

Ok, so output=AST (in 3.0 parlance) would construct a tree from whatever the input was: tokens or trees. A tree parser with rewrite rules, however, would do a rewrite. Let's see an example:

parser grammar langParser;
options {output=AST;}
program : decl+ -> ^(DECL decl+) ;
decl : FLOAT_TYPE ID ';' -> ^(FLOAT_TYPE ID) ;

which creates trees like this:

       DECL
      /    \
"float"  "float"
   |        |
  var1     var2

(ain't that easy using the new notation?!!)

then we'd have a tree grammar that created a new tree from the old:

tree grammar langTreeParser;
options {output=AST;}
program : ^(DECL decl+) ;
decl : ^(FLOAT_TYPE ID) -> ^(FLOAT_TYPE ID FLOAT["0.0"]) ; // add "= 0.0"

(note how the tree grammar is the collection of right-hand-sides from the -> rewrites). heh, somebody ought to write this down as doc ;)

Now, to do rewrites, one would leave off the output option:

tree grammar langTreeParser;
program : ^(DECL decl+) ;
decl : ^(FLOAT_TYPE ID) -> ^(FLOAT_TYPE ID FLOAT["0.0"]) ; // add "= 0.0"

Yes...I guess that makes sense. The method generated for rule "program" would have to replace the children list of DECL with the new subtrees created by rule decl. I guess like this:

program() {
  match(DECL);
  match(DOWN);
  while ( input.LA(1)==FLOAT_TYPE ) {
    result = decl();
    // replace child i with result tree
  }
  match(UP);
}

I am not sure how to do that easily, though I of course see the need (particularly given the tree sizes you refer to below...yikes!) The TreeNodeStream that serializes the trees for two-dimensional parsing could be asked to replace the node I suppose...sort of like asking an iterator to do a delete. Hmm...that is an interesting approach. It would only allow replacement of a node under the "cursor" at input.LT(1) I guess. It would have to make sure not to try to parse the new node. You might have a single node replaced with two nodes:

tree grammar t;
a : b+ ;
b : B -> X Y ; // replace B with X Y (2 nodes)

If the input were say B B then after the first call to b the "X Y" would replace the first B, yielding X Y B, but the TreeNodeStream would have to do a "replace and jump over new nodes" so the current pointer was on the 2nd B not the Y.

Regardless, the approach would invisibly allow a completely new tree to be created or to have a tree modified inline. Wow.