Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



ASTs, Parse Trees, etc...

RSS Feed
Terence Parr

April 20, 2007

Hmm... ^(PARMS parms+)? doesn't work. There are no real elements in the (...)? So, unlike loops then, the optional subrule has to check deeply for referenced elements. Alter define.g so that it sets a currentOptionalBlock not just current block. Add all elements to it?

April 19, 2007

Ok, got a problem. For multiple IDs in the following grammar, you should get multiple copies of both type and modifier result trees.

a : modifier? type ID (',' ID)* ';' -> ^(type modifier? ID)+ ;

Should get trees like:

(int public a) (int public b) (int public c)

Currently I get

(int public a) (int b) (int c)

Once the modifier is consumed in the modifier? it is not visible the next time around the loop:

while ( stream_ID.hasNext()||stream_type.hasNext()||
        stream_modifier.hasNext() )
{
    ...
    if ( stream_modifier.hasNext() ) {
        adaptor.addChild(root_1, stream_modifier.next());
    }
    ...
}

The condition stream_modifier.hasNext() is false after first execution of the optional subrule. One way to think about this is to reset the model fire stream after the optional subrule, but this makes the while conditionthink there are an infinite number of modifier trees in the stream_modifier.

Certainly it seems like a good idea to reset the streams after a subrule so that one can repeat a subrule and get a duplicate of everything inside; e.g.,

... -> ^(type ID)+ ^(type ID)+ ;

This should create two duplicate lists of trees, but currently you will get a cardinality exception because type and ID will run out of elements in their streams after the first subrule.

It seems clear that the main issue is nested subrules. I am using a condition for the outer subrule that tests how many elements are left in each stream referenced at any depth in the subrule. Any streams that get reset in a nested subrule cause an infinite outer loop. We need to be able to reset streams for nested subrules without affecting the condition of the enclosing subrule(s).

Ok, let's think about what ^(type modifier? ID)+ means; that is, how many times does that closure loop go around? The answer is that it goes around as long as there are elements left to emit. The only problem comes in when we need to duplicate. Of course, we only duplicate when the element is a size 1, which may lead me to a solution here. Let's consider the case where all streams have multiple values:

a : ID+ INT* -> (ID INT?)+ ; // assume |ID|==|INT|

In this case, there can be no duplication. We do not want to reset stream_INT after the optional subrule because we need to keep pulling from the stream not duplicate the first element. If there is a cardinality difference, it is an error. Now, let's consider the case where all elements have cardinality 1:

a : ID INT -> (ID INT?)+ ; // assume |ID|==|INT|==1

Loops can only go around once if all elements have the same cardinality and that carnality is 1. In this case, the optional subrule should not reset stream_INT because we want to exit the outer loop. There are no more ID nor INT elements to iterate on. The loop should enter when either stream has a value.

The problem then seems to be only when we have a mismatch between multi-valued streams and single valued streams:

a : 'int'? ID+ -> ('int' INT?)+ ; // assume |ID|==|INT|

Wait. Perhaps the answer is to avoid testing the streams for nested subrules. For example, the following is weird because there are no elements that the loop knows how to test: expr is in the subrule and the outer subrule does not know when to keep looping.

a : expr+ -> ^(EXPR expr?)+ ;

Now, with the following semantically-equivalent spec, we get one that works:

a : expr+ -> ^(EXPR expr)* ;

You will get a series of EXPR-rooted trees with single expr children. The loop will test for the only non-imaginary element in the loop: expr. Turning back to our original problem:

^(type modifier? ID)+

it seems that we should test only type and ID streams:

while ( stream_ID.hasNext()||stream_type.hasNext() ) {
    ...
    if ( stream_modifier.hasNext() ) {
        adaptor.addChild(root_1, stream_modifier.next());
    }
    stream_modifier.reset(); // NOW WE CAN RESET
    ...
}
stream_ID.reset();
stream_type.reset();
// further refs to these streams will yield elements

Ha! I think that's the trick!

April 14, 2007

With ref to labels, I think that we need to create streams for single-valued labels and list labels. Good, looks like it already tracks elements in two lists; e.g., ids+=ID (',' ids+=ID)* yields (partially):

ids=(Token)input.LT(1);
match(input,ID,FOLLOW_ID_in_d36); 
list_ID.add(ids);
if (list_ids==null) list_ids=new ArrayList();
list_ids.add(ids);

Ok, so we just have to convert List to RewriteRuleTokenStream or whatever. Ack! These are accessed from actions on left side of -> too. I guess we have to create a stream at start of -> that feeds off that list.

For single-valued labels such as t=type, we need to create:

RewriteRuleSubtreeStream list_type =
    new RewriteRuleSubtreeStream(adaptor,"label t",t.tree);

and then use list_type.next() not t.tree in the rewrite section.

We have referencedTokenLabels etc..., computed by codegen.g PER block. I think we need an overall list and then I can just create the stream objects at start of -> code section

Crap: we also need to know whether the label is a token label, rule label, or list version of that. Ok, I can compute that. Ack. What about multiple -> in subrules for a single rule? How does that affect list of rewrite elements? Track per rewrite? or globally per alt?

Ack. What about multiple -> separated by predicates? Hmm...have to prepare for any of them to execute. Treat them as one rewrite in terms of labels and stream set up.

April 13, 2007

ooops. What about labels? Seems like they should go into a stream too so the code is uniform.

Wait. Labels simply point at the last Token or subtree tracked for a rule element.

a : x=A y=b -> $x $y ;

This will yield the following defs:

Token x=null;
b_return y = null;
RewriteRuleTokenStream list_A =
       new RewriteRuleTokenStream(adaptor,"token A"); 
RewriteRuleSubtreeStream list_b =
       new RewriteRuleSubtreeStream(adaptor,"rule b");

and then

// x=A
x=(Token)input.LT(1);
match(input,A,FOLLOW_A_in_a32); 
list_A.add(x);

// y=b
pushFollow(FOLLOW_b_in_a36);
y=b();
_fsp--;
list_b.add(y.getTree());

Ok. Referencing $x and $y means what? How does $x differ from A in the rewrite rule in this case? Hm...well, let's look at what we do currently. The rewrite code gen does this:

adaptor.addChild(root_0, x); // x has to be a tree now
adaptor.addChild(root_0, y.tree); // fine

This seems ok, but what about duplication?

d : t=type ID (',' ID)* -> ^($t ID)+ ;

This collects elements like this:

// t=type
pushFollow(FOLLOW_type_in_d32);
t=type();
_fsp--;
list_type.add(t.getTree());

// ID
ID1=(Token)input.LT(1);
match(input,ID,FOLLOW_ID_in_d34); 
list_ID.add(ID1);

We need to duplicate type's return tree for every iteration of the rewrite loop. No copying at moment:

// build tree ^( $t ID )
Object root_1 = (Object)adaptor.nil();
root_1 = (Object)adaptor.becomeRoot(t.tree, root_1); // t.tree isn't dup'd
adaptor.addChild(root_1, list_ID.next());
adaptor.addChild(root_0, root_1);

Instead of t.tree, it should use list_type.next() I believe. Wait, no that doesn't let me distinguish between first and following elements needed for:

prog: main=method others+=method* -> ^(MAIN $main) $others* ; 

So, $main should literally be main.tree but how do we duplicate it if it's in a loop like $t above?

April 12, 2007

Ok, I think I have a new understanding or perspective on rewrite rules. We should imagine the various lists of tokens as separate input streams that the rewrite rule can draw from. We do not need to check the cardinality upfront--we can simply keep pulling from the list until we run out, at which point we can throw something analogous to a mismatched token exception. Avoiding the complex cardinality checks is a big win.

This mechanism also makes it easy to handle node duplication. If you're stream size is 1 and you are asked to produce more than one node, simply duplicate the first element. We will need to wrap each token reference in a method call or use a special List.That way, I can also avoid having to use indexes. Each list will act like a token stream that contains a cursor. next() can pull from the stream instead of get(i). Let's look at a few examples:

A -> A
root_0 = (Object)adaptor.nil();
adaptor.addChild(root_0, list_A.next());
A -> A A
root_0 = (Object)adaptor.nil();
adaptor.addChild(root_0, list_A.next());
adaptor.addChild(root_0, list_A.next()); // next() duplicates 1st A

That looks great. Hmm...how do we know whether to enter a subrule? If it is a single element or all simple elements (no nested subrules), it is easy:

A B -> (A B)?
root_0 = (Object)adaptor.nil();
// if any element has data, enter
if ( list_A.hasNext() || list_B.hasNext() ) {
    adaptor.addChild(root_0, list_A.next());
    adaptor.addChild(root_0, list_B.next());
}

If for some reason there are was not an A or B, next() would throw an exception. That seems ok.

Now, what do we do about subrules with random elements such as other subrules?

A B+ -> (A+ B+)?

How do we know whether to enter the optional subrule? To answer that question, we have to know whether the subrule will generate a tree. Knowing that requires recursively examining each element. Hmm.. I guess that is it: we need to simply recurs down to find all token and rule references mentioned within a subrule. If any of those have elements last, we enter the subrule. If there is a mismatch between cardinality, we will find it after we enter.

Here's a weird extreme case:

(A B)+ C D* -> ( ( (A B)+ C? D* )? )?

The entry expression for the outermost subrule is:

if (A or B or C or D have more elements) ...

The first nested subrule would test the same list. Basically, the rule is probably:

if there is a reference to an element that still has a token left to add to the tree, we should enter any surrounding subrule.

The (A B)+ subrule is entered if either A or B has a token. The remaining subrules are obvious.

How do we deal with checking cardinality for 0 versus 1? For example, (A B)? should give an error if:

  1. there is an A but not a B or vice versa (|A|=0 but |B|>0 or vice versa).
  2. |A|>1 or |B|>1

The subrule is entered if either A or B have tokens as a general rule regardless of the type of subrule--condition 1 could happen. In that case, one of the next two statements will fail:

if ( A or B has elements ) {
  adaptor.addChild(root_0, list_A.next());
  adaptor.addChild(root_0, list_B.next());
}

I guess that is what we want...the subrule would not have been entered if neither A nor B had tokens.

Now, what about loops like (A B)+? The loop will enter and continue as long as either A or B have elements. To be valid, |A|=|B| must hold or |A|>1 and |B|=1 or vice versa. In either case, at least one of the lists will have the token so the loop will enter. Let's consider the case where |A|=0 and |B|=1 to see if we catch it:

// (...)+ looks must have at least one iteration
if ( !(A or B has elements) ) throw missing element exception
// now can loop assuming there is something to iterate over
while ( A or B has elements ) {
  adaptor.addChild(root_0, list_A.next()); // |A|=0 so this fails
  adaptor.addChild(root_0, list_B.next());
}

Ok, that works and so would |A|=1 and |B|=0. Let's try the case where both are n=|A|=|B|. That would work also because each iteration would pull one value from the lists. What about the case where |A|=1 and |B|>1, say, |B|=2? We may have a problem knowing when to "smear" the As (duplicate for each B).

while ( A or B has elements ) {
  adaptor.addChild(root_0, list_A.next()); // |A|=1
  adaptor.addChild(root_0, list_B.next()); // |B|=2, pull one each iteration
}

What about repeated references to the same element? A -> A A should copy, clearly. What about A A -> A A? Seems like that should simply have two output creation sites that draw from the same list:

adaptor.addChild(root_0, list_A.next());
adaptor.addChild(root_0, list_A.next());

So what about A+ -> A A*? This would go against what the book says I think, but makes sense in this new perspective:

adaptor.addChild(root_0, list_A.next());
while ( list_A.hasNext() ) {
    adaptor.addChild(root_0, list_A.next());
}

In the book, I say that if you want to treat the first of several elements differently, you need to use labels; But, I'm not sure that is true. For example, I said you had to do this:

prog: main=method others+=method* -> ^(MAIN $main) $others* ; 

but, why not do this:

prog: method+ -> ^(MAIN method) method* ; 

This way we can look at the term method as simply a source of subtrees and we use them up as the references go along. This does not give you very much flexibility though in terms of the order. For example the following does not work in that you never get a proper MAIN subtree (there will not be a method child):

prog: method+ -> method* ^(MAIN method) ;

I think one of the things that I'm trying to do is remove the ordering problem. I guess this is a rare case and then you could still use the label solution. Most of the time, people will only have one reference. If there is more than one reference, let's imagine that it consumes from left to right lexically.

Ok, more copying concerns. How do we know when to copy when we are not in the loop? What code is generated for A -> A A? Perhaps this:

adaptor.addChild(root_0, list_A.next());
adaptor.addChild(root_0, list_A.next()); // copies since |A|=1

What about A A -> A A A, though?

adaptor.addChild(root_0, list_A.next()); // index 0, no problem
adaptor.addChild(root_0, list_A.next()); // index 1, no problem
adaptor.addChild(root_0, list_A.next()); // |A|=2, cause error; can't dup!

It seems that we should cause an error rather than duplicate. We do not want to have a rule where we smear the last element out because otherwise that would screw up loops that had elements with mismatched cardinality. Ok, seems like the rule should be:

next() duplicates if |list|=1 and already emitted list.get(0)

Let's check some duplication related to loops:

a : type ID (',' ID)* -> type ID (type ID)*

That should generate something like the following:

adaptor.addChild(root_0, list_type.next());
adaptor.addChild(root_0, list_ID.next());
while ( type or ID has more elements ) { // type finished at this point
    adaptor.addChild(root_0, list_type.next()); // |type|=1, duplicate
    adaptor.addChild(root_0, list_ID.next());   // keep pulling from ID list
}

What about the following? How long should the loop go around

a : b -> (b b)+ ;

The loop will enter the first time because 'b' has one element but should only go around once once we've consumed it.

while ( b has more elements ) {
    adaptor.addChild(root_0, list_b.next()); // emit first and only b
    adaptor.addChild(root_0, list_b.next()); // |b|=1, duplicate
} // only go around once because be has no more elements

Ok, this all seems very consistent to me. Hooray! Let's try to formalize this:

  • token and rule references translated to
    adaptor.addChild(root_0, list_x.next());
    
    where next() returns the next element or duplicate if |x|=1.
  • (...)? translated to
    if ( any list has more tokens/trees ) {
        ... build trees for the block elements ...
    }
    
    where "any list" means the lists for any element referenced within the block or in any nested block.
  • (...)* translated to
    while ( any list has more tokens/trees ) {
        ... build trees for the block elements ...
    }
    
  • (...)+ translated to
    if ( !(any list has more tokens/trees) ) throw missing element exception
    while ( any list has more tokens/trees ) {
        ... build trees for the block elements ...
    }
    

The only pain is to recursively find the list of referenced elements.

April 11, 2007

Ok, the AST rewrite cardinality checks and node duplication has some bugs. I need to rethink to be more consistent and correct. One issue is that the current code generation assumes that all elements within a loop at the same cardinality, but this is not always true:

decl : 'int' ID (',' ID)* -> ^('int' ID)+ ;

In this case, we need to duplicate 'int' for every ID found in the input string. Input

int a,b,c;

should create the following tree:

(int a) (int b) (int c)

Further, its cardinality is 1, which is not equal to the cardinality of the ID list in this case. The same is true for user rule:

decl : type ID (',' ID)* -> ^(type ID)+ ;
type : 'int' ;

We need to duplicate the tree return from type each iteration through the tree construction loop in the rewrite. Previously I had only done the duplication when the rule was mentioned twice as in:

a : atom -> ^(atom atom) ; // NOT CYCLE! (dup atom)

My goal in this entry is to brainstorm about how each EBNF rewrite construct should be translated to code. Let's take some examples to see if we can extract what to do, given what ANTLR does now. Let's starting with the optional EBNF construct:

(...)? Zero-or-one blocks

rewrite Code
A -> A?
root_0 = (Object)adaptor.nil();
// compute length of the only element
int n_1 = list_A == null ? 0 : list_A.size();
// ensure that length is zero or one
if ( n_1 > 1 ) throw new RuntimeException("A list has > 1 elements");
// only enter the optional loop if there is something to generate
if ( n_1==1 ) {
    int i_1 = 0;
    adaptor.addChild(root_0, (Token)list_A.get(i_1));
}
A B -> (A B)? Now, we need to verify that both elements are the same length and that that length is not greater than 1.
root_0 = (Object)adaptor.nil();
// compute length of any element
int n_1 = list_A == null ? 0 : list_A.size();
// ensure that the other list is the same size
if ( list_B.size()!=n_1 )
    throw new RuntimeException("rewrite element B list differs "+
                               "in size from other elements");
if ( n_1 > 1 ) throw new RuntimeException("A list has > 1 elements");
// only enter the optional loop if there is something to generate
if ( n_1==1 ) {
    int i_1 = 0;
    adaptor.addChild(root_0, (Token)list_A.get(i_1));
    adaptor.addChild(root_0, (Token)list_B.get(i_1));
}       

(...)* Zero-or-more blocks

rewrite Code
A -> A*
int n_1 = list_A == null ? 0 : list_A.size();
for (int i_1=0; i_1<n_1; i_1++) {
    adaptor.addChild(root_0, (Token)list_A.get(i_1));
}
(A B)* -> (A B)*
int n_1 = list_A == null ? 0 : list_A.size();
if ( list_B.size()!=n_1 )
    throw new RuntimeException("rewrite element B list differs "+
                               "in size from other elements");
for (int i_1=0; i_1<n_1; i_1++) {
    adaptor.addChild(root_0, (Token)list_A.get(i_1));
    adaptor.addChild(root_0, (Token)list_B.get(i_1));
}       
A B* -> A B*

Ok, now I'm remembering...

If everything inside the loop has cardinality 1, then loop goes around n=1 times. If at least one element has cardinality > 1, then n = cardinality of that element. All elements with cardinality > 1 must have exactly the same cardinality. What about cardinalities 1 and 0, as in:

r : A B* -> (A B)* ;

That will fail if |B|=0. What would the rewrite produce? So, let's add a rule: If all elements are <= 1, then all must same cardinality.

Ok, so cardinalities can be:

  • all 0
  • all 1
  • some 1 and the rest n
  • all n

Ok, I guess we can say: all cardinalities must be same, n, 'cept that for blocks with all non-zero cardinalities can be n or 1.

Hmm...what about non-elements like nested subrules?

Nested subrules
rewrite Code
A? B C -> (A? B C)+
int n_1 = list_B == null ? 0 : list_B.size();
if ( list_C.size()!=n_1 )
    throw new RuntimeException("rewrite element C list differs "+
                               "in size from other elements");
if ( n_1==0 ) throw new RuntimeException("Must have more than one "+
                                         "element for (...)+ loops");
for (int i_1=0; i_1<n_1; i_1++) {
    // u.g:11:16: ( A )?
    int n_2 = list_A == null ? 0 : list_A.size();
    if ( n_2 > 1 ) throw new RuntimeException("A list has > 1 elements");
    if ( n_2==1 ) {
        int i_2 = 0;
        adaptor.addChild(root_0, (Token)list_A.get(i_2));
    }
    adaptor.addChild(root_0, (Token)list_B.get(i_1));
    adaptor.addChild(root_0, (Token)list_C.get(i_1));
}
A? B? -> (A? B?)+ What about a loop with no elements, just subrules?
// current code gen has MISSING "int n_1" DECLARATION!
if ( n_1==0 )
    throw new RuntimeException("Must have more than one "+
                               "element for (...)+ loops");
for (int i_1=0; i_1<n_1; i_1++) {
    // u.g:13:15: ( A )?
    int n_2 = list_A == null ? 0 : list_A.size();
    if ( n_2 > 1 ) throw new RuntimeException("A list has > 1 elements");
    if ( n_2==1 ) {
        int i_2 = 0;
        adaptor.addChild(root_0, (Token)list_A.get(i_2));
    }
    // u.g:13:18: ( B )?
    int n_2 = list_B == null ? 0 : list_B.size();
    if ( n_2 > 1 ) throw new RuntimeException("B list has > 1 elements");
    if ( n_2==1 ) {
        int i_2 = 0;
        adaptor.addChild(root_0, (Token)list_B.get(i_2));
    }
}

June 20, 2006

Currently we cannot do tree->tree transforms using tree grammars. If you try, you can see that the basic mechanism borrowed straight from parser -> AST almost works. My concern is how to deal with sharing tree nodes. Clearly if you only want to go through and replace x*0 with 0 everywhere in expressions you should be able to reuse the rest of the tree; i.e., without dup'ing it.

I think the way to deal with that is that if you don't create a tree, return the incoming tree pointer as the result tree.

Bernhard Damberger: Two kinds of lines of code are causing problems:

adaptor.setTokenBoundaries(retval.tree,retval.start,retval.stop);

In this case retval.start and retval.stop are Objects, not Tokens.

Yes, that would need to be turned off when building trees from a tree; those values are set and should not be touched.

And

string_literal4=(Object)input.LT(1);
match(input,118,FOLLOW_118_in_statement63);
string_literal4_tree =(Object)adaptor.create(string_literal4);

on the third line because string_literal4 is expected to be a token, but its a Object.

We could add that method create(object) to the adaptor no problem. The issue is that we'd be dup'ing nodes like crazy.

I think tree grammars should by default just return the input tree. If you have -> rewrites then a new subtree for a rule is created. ^ and ^^ would not be allowed and I'd like to disallow ! also to be consistent. ! could be useful but mostly likely you would want to remvoe a node upon some condition, hence, just saying ID! is not that useful. If you unconditionally knew you didn't want ID in the tree, you could have used ID! when creating the tree. Actually, you could also reuse old subtrees too:

expr
    :   ^(ADD x=expr y=expr)    // return either subtree or whole thing
            -> {isZero($x)}? $y
            -> {isZero($y)}? $x
            -> $expr
    |   ^(MULT x=expr y=expr)
            -> {isZero($x)}? INT["0"]
            -> {isZero($y)}? INT["0"]
            -> $expr
    |   ID                      // same ID node returned
    ;

So every rule would by default terminate with a hidden "-> $rule"

Perhaps you'd also be interested in inserting nodes anywhere in the tree when you see, for example, an implicit var def (x=3). You'd want to insert a def for x somewhere above. You could say

engine.insertAfter($block.getFirstChild(), ^(DECL, TYPE["int"], $ID))

or some such. Hmm...needs more thinking...

April 26, 2006

updated April 27, 2006

Musings about adding AST debugging to debug event listener ANTLRWorks.

Functionality

Show tree being built. Sequence of events coming from running parser. Stepping by location will show nicely.

Tree view for smaller trees and "directory" view for big trees.

Do we show all subrule roots or just rule's root? How would we display?

a : A (','^ A)* ;

As subrules create trees, stack the trees and then hook them together as subrule finishes (animate?). Events can be sent like r0=..., r1=..., r2=... where the n indicates depth like stack and r0 is rule's root ptr. Every rn=... signal could mean start a new subtree below current (pushing up existing to make room or something). A close rn signal is sent to say done with that tree.

Do we allow users to see trees returned by rule refs?

a : x=expression SEMI ; // can I see x's tree?
Sure. Send label events for label=ID, but how to display in AW?

Nodes have start/stop token index set that is the range of tokens for that node and below from which this subtree was created. Useful to highlight input sequence associated with start/stop index for a highlighted node.

Setting up adaptor. Current tree construction:

protected TreeAdaptor adaptor = new CommonTreeAdaptor();

public void setTreeAdaptor(TreeAdaptor adaptor) {
    this.adaptor = adaptor;
}

We should change to be DebugTreeAdaptor but I won't auto-wrap the adaptor in setTreeAdaptor...they might want to set their own debug adaptor.

Events

create nil (even nil nodes have a unique ID...they are not "null" per se)

r0 = nil // signal that tree construction for r0 commences

create node from token index t; assigns unique node ID (stored in tree)

becomeRoot node ID, root ID

r0 = node ID (as returned from becomeRoot)

addChild root ID, n

addChild r0.ID, r1.ID

close r1 // do we need this for AW???

close r0 // if so, do we need this or is it implied by rulePostProcessing

rulePostProcessing(tree) // converts ^(nil x) to x only so far

setTokenBoundaries r0.ID, start, stop // r0 might be new node

label = node ID // rule return labels

becomeRoot token index, root ID // rewrites

addChild root ID, token index // rewrites

Ok, added code; might be like this:

/** A nil was created (even nil nodes have a unique ID...
 *  they are not "null" per se)
 */
public void nilNode(int ID);

/** root_0 = ...;  Often (but not always) indicates start of new subtree.
 *  Does enterSubrule event help here?
 */
public void setSubRuleRoot(String name, int ID);

public void createNode(int ID, String text, int type);

public void createNode(int ID, int tokenIndex);

public void becomeRoot(int newRootID, int oldRootID);

public void addChild(int rootID, int childID);

/** converts ^(nil x) to x */
public void trimNilRoot(int ID);

public void setTokenBoundaries(int ID, int tokenStartIndex, int tokenStopIndex);

Issues

How to identify nodes so we can say "add node to a prior node"? Even becomeRoot is an issue. Ok, number the nodes as they are created? Will have to wrap the tree and delegate to get the unique node ID...create a DebugTreeAdaptor that delegates to a TreeAdaptor? Yep. Ah, nope...that will mess of tree construction. Use a Map<Tree,Integer> or can we get away with a node's hashCode? No, two identical nodes could be in tree and hashCode would be same if they implement that method. Damn...no way to get address. I wonder if we can check to see if they implement hashCode and if so use the super's hashCode. Nope...node.super.hashCode() doesn't parse. Let's assume hashCode for now. Equals will have to be implemented, but hashCode must remain unimplemented against Java doc. No biggie as it's unlikely that people will add AST nodes to HashMaps.

When creating nodes from tokens or doing addChild/becomeRoot, can I refer to tokens by their index since AW has the list?

Do we step through tree construction in rewrite rules? Can emit location events I suppose. Useful? Say no at beginning but still need to show tree at end of rule. Final location will show this I think.

Throw out tree on error? Do we return partial tree now? Are tree actions turned off during recovery mode? Seems like they should be.

What kind of trees do we work with? TreeAdaptor knows how to get the text and token indexes (if any). Create event sends ID, text, token type

Test Grammar

This grammar sort of shows what events are triggered (in generated code).

a 	:	'a'^ 'b' 'c'! ('d'^^ 'e')? ;

create nil (even nil nodes have a unique ID...they are not "null" per se)
r0 = nil // signal that tree construction for r0 commences
create node from token index t; assigns unique node ID (stored in tree)
becomeRoot node ID, r0.ID
r0 = node ID (as returned from becomeRoot)
create node n, token index t
addChild r0.ID, n
create nil
r1 = nil // start new subtree
create n, token index t
becomeRoot node n, r0.ID
r0 = n from becomeRoot
create n, t
addChild r1.ID, n
addChild r0.ID, r1.ID
close r1 // do we need this for AW???
close r0 // if so, do we need this or is it implied by rulePostProcessing
rulePostProcessing
setTokenBoundaries r0.ID, start, stop
b 	:	'a' 'b' 'a' -> ^('b' 'a'+) ;

// during rewrite:
r0 = nil
r1 = nil
create btokens[0]
becomeRoot n, r1.ID
addChild r1.ID, atokens[0]
addChild r1.ID, atokens[1]
addChild r0.ID, r1.ID
close r1
close r0
rulePostProcessing
setTokenBoundaries r0.ID, start, stop
c 	:	x=a ;

r0 = nil
x = ID of tree returned from a
addChild r0, x
close r0
rulePostProcessing
setTokenBoundaries r0, start, stop
d 	:	('a'^ ('b' 'c')) ('d' 'e') ;

nil
r0 = nil
nil
r1 = nil
create
becomeRoot n, r1
r1 = result
nil
r2 = nil
create
addChild r2, n
create
addChild r2, n
addChild r1, r2
close r2
addChild r0, r1
close r1
nil
r1 = nil
create
addChild r1, n
create
addChild r1, n
addChild r0, r1
close r1
close r0
rulePostProcessing
setTokenBoundaries r0, start, stop
e	:	'a' {;} | 'b' ;

// Both are same
r0 = nil
create n
addChild r0, n
close r0
rulePostProcessing
setTokenBoundaries r0, start, stop

Ok, some real sample output. From, this grammar

grammar T;
options {
        output = AST;
}

a       :       'a'^ 'b' 'c'! ('d'^^ 'e')? ;

and this main:

import java.io.*;
import org.antlr.runtime.*;
import org.antlr.runtime.debug.*;
import org.antlr.runtime.tree.*;

public class Test {

    public static class D extends BlankDebugEventListener {
        public void enterRule(String ruleName) { System.out.println("enterRule "+ruleName); }
        public void exitRule(String ruleName) { System.out.println("exitRule "+ruleName); }
        public void enterSubRule(int decisionNumber) { System.out.println("enterSubRule"); }
        public void exitSubRule(int decisionNumber) { System.out.println("exitSubRule"); }

        // AST stuff
        public void nilNode(int ID)
	 {System.out.println("nilNode "+ID);}
        public void setSubTreeRoot(String name, int ID)
	 {System.out.println(name+"="+ID);}
        public void createNode(int ID, String text, int type)
	 {System.out.println("create "+ID+": "+text+", "+type);}
        public void createNode(int ID, int tokenIndex)
	 {System.out.println("create "+ID+": "+tokenIndex);}
        public void becomeRoot(int newRootID, int oldRootID)
	 {System.out.println("becomeRoot "+newRootID+", "+oldRootID);}
        public void addChild(int rootID, int childID)
	 {System.out.println("addChild "+rootID+", "+childID);}
        public void trimNilRoot(int ID)
	 {System.out.println("trimNilRoot "+ID);}
        public void setTokenBoundaries(int ID, int tokenStartIndex, int tokenStopIndex)
	    {System.out.println("setTokenBoundaries "+ID+",
	            "+tokenStartIndex+", "+tokenStopIndex);}
    }

    public static void main(String args[]) throws Exception {
        TLexer lex = new TLexer(new ANTLRFileStream("/Users/parrt/tmp/ast/input.txt"));
        CommonTokenStream tokens = new CommonTokenStream(lex);
        T g = new T(tokens, new D());
        T.a_return r = null;
        try {
            r = g.a();
        } catch (RecognitionException e) {
            e.printStackTrace();
        }
        System.out.println(((Tree)r.tree).toStringTree());
    }
}

I get this output for "abcde":

enterRule a
nilNode 15699187
create 970110: 0
becomeRoot 970110, 15699187
create 13967556: 1
addChild 970110, 13967556
enterSubRule
nilNode 5913963
create 7752673: 3
becomeRoot 7752673, 970110
create 7327927: 4
addChild 5913963, 7327927
addChild 7752673, 5913963
exitSubRule
trimNilRoot 7752673
setTokenBoundaries 7752673, 0, 4
exitRule a
(d (a b) e)

June 22, 2005

Rewrite rules

The new rewrite stuff is a grammar to grammar mapping where you specify how to match something and then how to generate the resulting tree. Note that you can generate a token stream by simply generating a flat list like:

a : A B -> B A ; // reverse the input

The goal is to avoid arbitrary actions and the key insight to get pure rewrites was that the right side of the -> should be a tree grammar complete with (...)+ loops etc... Monty, Loring, and I discussed the rewrite rules extensively at the Oregon cabal 2 years ago but we couldn't figure out what to do with repeated elements like:

list : ID (',' ID)* ;

How can you generate just the list of IDs, for example. Easy. Just list the tree grammar fragment that would match what you would like to generate and ANTLR will figure out how to do the transformation. The elements on the left are queued up and then used as input streams to the "unparser" grammar on the right. Here is the rewrite to get the list of IDs:

list : ID (',' ID)* -> ID+ ;

Nice, eh? What if I want an imaginary token node, VAR, above the list?

list : ID (',' ID)* -> ^(VAR ID+) ;

What if I want a VAR for each ID? Move the VAR inside the plus loop:

list : ID (',' ID)* -> ^(VAR ID)+ ;

I think that is pretty slick. Note that to get your tree parser grammar, ANTLR should be able to copy these rules for you into a tree grammar similar to what Loring does, although his 2.8e tree grammar creator is more sophisticated than what I have in mind.

I have not done the tree parsers yet, but I think it will be really easy. The details may kill me though. ;) I should push an ANTLR-3.0ea3 release shortly.

Notes

My brief comments in the README.txt file:

1. Automatic tree construction operators are in: ! ^ ^^

2. Tree construction rewrite rules are in -> pred1? rewrite1 -> pred2? rewrite2 ... -> rewriteN

The rewrite rules may be elements like ID, expr, $label, {node expression} and trees ^( <root> <children> ). You have have (...)?, (...)*, (...)+ subrules as well.

You may have rewrites in subrules not just at outer level of rule, but any -> rewrite forces auto AST construction off for that alternative of that rule.

To avoid cycles, copy semantics are used:

r : INT -> INT INT ;

means make two new nodes from the same INT token.

Repeated references to a rule element implies a copy for at least one tree:

a : atom -> ^(atom atom) ; // NOT CYCLE! (dup atom tree)

3. $ruleLabel.tree refers to tree created by matching the labeled element.

Examples

Here are some example rewrites. I have about 40 functional tests for rewrites and about the same for automatic tree construction.

a : ID INT -> ; // don't generate anything

a : b b -> b+;  // generate a list of two b's

a : ID INT -> ^(ID INT) ; 

a : ID INT -> {false}? ^(ID INT)
           -> {true}? ^(INT ID)
           -> ID
  ;

a : op INT -> ^(op INT);
op : '+'|'-' ;

a : "var" (ID ':' type ';')+ -> ^("var" ^(':' ID type)+) ;

a : ID (',' ID)*-> ^(VAR ID)+ ; // create imaginary VAR node above each ID

a : lc='{' ID+ '}' -> ^(BLOCK[$lc] ID+) ; // derive BLOCK imaginary token node from $lc

a : first=ID (',' others+=ID)* -> $first $others+ ; // use of += labels

June 8-9, 2005

Ok, I've cut a development branch from ANTLR 3.0ea1 and am starting to work on the tree component. There are four main problems:

  1. tree and tree node structure
  2. tree construction
  3. tree rewrite rules
  4. tree parsing

Tree structure and construction

As I've mentioned previously, I'm moving away from using child-sibling trees for the common tree implementation using a simple list-of-children approach. To avoid locking everyone into the same implementation, I'm going to use an approach similar to the TreeModel thingie in Swing's tree viewing classes. Stanchfield got me thinking about this; let's see how it works out...

Let's try out an example:

a : ID '='^ INT ';'! ;

This should generate something like:

a() {
	Object a_root = adaptor.nil();

	Token ID_1 = input.LT(1);
	match(ID);
	// ick, typecasts...i'll ignore typecasts in rest of code
	UserSpecifiedTreeType ID_1_tree = (UserSpecifiedTreeType)adaptor.create(ID_1);
	adaptor.addChild(a_root, ID_1_tree);

	EQUALS_1 = input.LT(1);
	match(EQUALS);
	UserSpecifiedTreeType EQUALS_1_tree = adaptor.create(EQUALS_1);
	a_root = adaptor.becomeRoot(EQUALS_1_tree, a_root);

	INT_1 = input.LT(1);
	match(INT);
	UserSpecifiedTreeType INT_1_tree = adaptor.create(INT_1);
	adaptor.addChild(a_root, INT_1_tree);

	match(SEMI);
}

Ok, that took me about 6 rewrites to get something that also works for more complicated (double root operation) stuff like:

b : A PLUS^ B PLUS^ C ;

which should generate something like:

b() {
	Object b_root = adaptor.nil();

	A_1 = input.LT(1);
	match(A);
	UserSpecifiedTreeType A_1_tree = adaptor.create(A_1);
	adaptor.addChild(b_root, A_1_tree);

	PLUS_1 = input.LT(1);
	match(PLUS);
	UserSpecifiedTreeType PLUS_1_tree = adaptor.create(PLUS_1);
	b_root = adaptor.becomeRoot(PLUS_1_tree, b_root);

	B_1 = input.LT(1);
	match(B);
	UserSpecifiedTreeType B_1_tree = adaptor.create(B_1);
	adaptor.addChild(b_root, B_1_tree);

	PLUS_2 = input.LT(1);
	match(PLUS);
	UserSpecifiedTreeType PLUS_2_tree = adaptor.create(PLUS_2);
	// if b_root nil root, just move children else add b_root as child
	b_root = adaptor.becomeRoot(PLUS_2_tree, b_root);

	C_1 = input.LT(1);
	match(C);
	UserSpecifiedTreeType C_1_tree = adaptor.create(C_1);
	adaptor.addChild(b_root, C_1_tree);
}

Note that the tree construction occurs after the token has been matched so that errors occur before a bogus node can be added to the tree. Instead, one can add an error node in the tree to mark the problem spot.

Interestingly, the notion of a TreeAdaptor, means that I don't actually have to know any of the methods of a Tree. I can literally use Object! The only reason to know the type is so that actions can do $label.tree.myOperation() without having to cast $label.tree. I should still define a basic Tree interface so we can get the text and payload out etc... Hmm...shouldn't the adaptor do this too? Normally, I'd do

t.getPayload();
t.getText();
...

but we could also use:

adaptor.getPayload(t);
adaptor.getText(t);

This is a very non-OO thing to do (more like C), but it does provide some breathtaking flexibility, though I'm not sure we need it.

Naturally as the programmer, you will know what kind of tree you have and can directly say t.getText() etc...

Heh, do I really need the "payload"? That's an implementation detail and as long as an adaptor can answer questions about text, line, column, etc... who cares how they store stuff? The only thing I want to make sure is that you can't create cycles very easily when constructing trees. I assumed everything was going to be a payload object, but now, the adaptor will say how to create a node from a Token or other object. My TreeAdaptor looks like this so far:

public interface TreeAdaptor {
	/** Create a tree node from some payload object (normally a Token) */
	public Object create(Object payload);

	/** Add a child to the tree t */
	public void addChild(Object t, Object child);

	/** If oldRoot is a nil root, just copy or move the children to newRoot.
	 *  If not a nil root, make oldRoot a child of newRoot.
	 */
	public Object becomeRoot(Object newRoot, Object oldRoot);
	...
}

Tree construction operators and subrules

How to translate?

// build ^(VAR ^(COLON ID type))
decl : VAR^ (ID COLON^ type SEMI!)? ;

Try this:

decl() {
	Object decl_root = adaptor.nil();

	VAR_1 = input.LT(1);
	match(VAR);
	UserSpecifiedTreeType VAR_1_tree = adaptor.create(VAR_1);
	adaptor.addChild(decl_root, VAR_1_tree);

	Object subrule_1_root = null;
	if ( input.LA(1)==ID ) {
		subrule_1_root = adaptor.nil(); // track subrule's tree

		ID_1 = input.LT(1);
		match(ID);
		UserSpecifiedTreeType ID_1_tree = adaptor.create(ID_1);
		adaptor.addChild(subrule_1_root, ID_1_tree);

		COLON_1 = input.LT(1);
		match(COLON);
		UserSpecifiedTreeType COLON_1_tree = adaptor.create(COLON_1);
		subrule_1_root = adaptor.becomeRoot(COLON_1_tree, subrule_1_root);

		// add any children from subrule as child of rule's tree
		adaptor.addChild(decl_root, subrule_1_root);
		// this decl_root would be replaced with another subrule's
		// root ptr if the subrule were nested!
	}
}

That seems easy. How about:

// should build ^(PLUS ^(PLUS 3 4) 5) for "3+4+5"
e : INT (PLUS^^ INT)* ;

Should build taller and taller tree as more PLUS are seen:

e() {
	Object e_root = adaptor.nil();

	INT_1 = input.LT(1);
	match(INT);
	UserSpecifiedTreeType INT_1_tree = adaptor.create(INT_1);
	adaptor.addChild(e_root, INT_1_tree);

	Object subrule_1_root = null;
	while ( input.LA(1)==PLUS ) {
		subrule_1_root = adaptor.nil(); // track subrule's tree

		PLUS_1 = input.LT(1);
		match(PLUS);
		UserSpecifiedTreeType PLUS_1_tree = adaptor.create(PLUS_1);
		e_root = adaptor.becomeRoot(PLUS_1_tree, e_root);

		// add the next INT to the subrule's list, but later add
		// to e_root at end of subrule execution
		INT_2 = input.LT(1);
		match(INT);
		UserSpecifiedTreeType INT_2_tree = adaptor.create(INT_2);
		adaptor.addChild(subrule_1_root, INT_2_tree);

		adaptor.addChild(e_root, subrule_1_root);
	}
}

Here you must imagine that anything matched within a subrule is added as a child to the rule's current root. Since I have PLUS^^ in there, the current rule's root is set to that node so the second INT is added as a child of it. If one used postfix notation, it would still work:

// should build ^(PLUS ^(PLUS 3 4) 5) for "3 4 + 5 +"
e : INT (INT PLUS^^)* ;

Reversing the order shouldn't affect anything:

e() {
	Object e_root = adaptor.nil();

	INT_1 = input.LT(1);
	match(INT);
	UserSpecifiedTreeType INT_1_tree = adaptor.create(INT_1);
	adaptor.addChild(e_root, INT_1_tree);

	Object subrule_1_root = null;
	while ( input.LA(1)==PLUS ) {
		subrule_1_root = adaptor.nil(); // track subrule's tree

		INT_2 = input.LT(1);
		match(INT);
		UserSpecifiedTreeType INT_2_tree = adaptor.create(INT_2);
		adaptor.addChild(subrule_1_root, INT_2_tree);

		PLUS_1 = input.LT(1);
		match(PLUS);
		UserSpecifiedTreeType PLUS_1_tree = adaptor.create(PLUS_1);
		e_root = adaptor.becomeRoot(PLUS_1_tree, e_root);

		adaptor.addChild(e_root, subrule_1_root);
	}
}

Ok, something harder:

// build ^(C ^(A X) B D)
// whereas if it were "C^" instead you'd get ^(A X ^(C B) D)
a : A^ X (B C^^)? D ;

The ^^ must set the outer rule tree:

a() {
	Object a_root = adaptor.nil();

	A_1 = input.LT(1);
	match(A);
	UserSpecifiedTreeType A_1_tree = adaptor.create(A_1);
	a_root = adaptor.becomeRoot(A_1_tree, a_root);

	X_1 = input.LT(1);
	match(X);
	UserSpecifiedTreeType X_1_tree = adaptor.create(X_1);
	adaptor.addChild(a_root, X_1_tree);

	Object subrule_1_root = null;
	if ( input.LA(1)==B ) {
		subrule_1_root = adaptor.nil();

		B_1 = input.LT(1);
		match(B);
		UserSpecifiedTreeType B_1_tree = adaptor.create(B_1);
		adaptor.addChild(subrule_1_root, B_1_tree);

		C_1 = input.LT(1);
		match(C);
		UserSpecifiedTreeType C_1_tree = adaptor.create(C_1);
		a_root = adaptor.becomeRoot(C_1_tree, a_root); // become rule root

		adaptor.addChild(a_root, subrule_1_root);
	}
}

After each execution of a subrule or subrule iteration, the tree is added as a child to the current tree. If the subrule's tree is flat, all the nodes created in the subrule are added as children of the current rule's root node.

Trees from rule refs

How to handle tree results from rules?

decl returns [int foo] 
    : type^ ID ';'! {$foo=3;}
    ;
type: "int"
    ;

Each rule has an implicit tree return value (which can be accessed via $ruleRef.tree in actions).

protected static class decl_return {
    Object tree;
    int foo;
};

decl_return decl() {
	decl_return retval = new decl_return();
	Object decl_root = adaptor.nil();

	type_1 = type(); // invoke rule type
	// exception if type_1 not simple node
	decl_root = adaptor.becomeRoot(type_1, decl_root); 

	ID_1 = input.LT(1);
	match(ID);
	UserSpecifiedTreeType ID_1_tree = adaptor.create(ID_1);
	adaptor.addChild(decl_root, ID_1_tree);

	match(SEMI);

	retval.foo = 3; // user action to set return value

	retval.tree = decl_root;

	return retval;
}

Object type() { // return a tree (no other args)
	Object a_root = adaptor.nil();

	INTTYPE_1 = input.LT(1);
	match(INTTYPE);
	adaptor.addChild(a_root, INTTYPE_1);

	return a_root;
}

Tree rewrite rules

While most tree construction can be handled by the ^ and ^^ operators, sometimes you want to specify the structure explicitly in a declarative fashion rather than as the emergent behavior of the tree operations. The general syntax is:

r : alt1 -> rewrite1
  | alt2 -> rewrite2
  | alt3 // assume normal tree construction occurs
  ;

Sometimes you need semantic to dictate which rewrite to use. Semantic predicates work great here:

r : alt -> {pred1}? rewrite1
        -> {pred2}? rewrite2
        -> rewrite3 // default case
  ;

Simple no-loop alternatives

In alternatives without subrules, rewrite rules are trivial as they are simply reorderings etc...

a : A B C -> ^(A B C)
  | X Y   -> Y X
  ;

To generate trees from rewrite rules, just reverse the recognition process:

a() {
	Object a_root = null;

	if ( input.LA(1)==A ) {
		A_list.add(input.LT(1));
		match(A);
		B_list.add(input.LT(1));
		match(B);
		C_list.add(input.LT(1));
		match(C);

		a_root = adaptor.create(A_list.get(0)); // err if A_list.size()>1
		adaptor.addTokensAsChildren(a_root, B_list);		
		adaptor.addTokensAsChildren(a_root, C_list);
	}
	else {
		// first match
		X_list.add(input.LT(1));
		match(X);
		Y_list.add(input.LT(1));
		match(Y);

		// then build trees
		a_root = adaptor.nil();
		adaptor.addTokensAsChildren(a_root, B_list);		
		adaptor.addTokensAsChildren(a_root, C_list);
	}
}
Creating a list for single elements would be very expensive. Ok, on my 1Ghz laptop, I just ran this test for n=1,000,000:
for (int i=1; i<=n; i++) {
  List a = new ArrayList();
  a.add(String.valueOf(i)); // add one element
}

which took 1300ms versus more like 1400 or 1500 for LinkedList. When I made it add two elements, it jumps to 2400! 3 elements goes to 3700ms. So for a million tokens, which would be a big file, my crappy laptop can add every element to its own list in 1.3s. I'd call that fast enough. (JDK 1.4.2).

When the GC kicks in, though, all of these extra objects could drag it down.

Ok, how about labeled stuff?

r : a=A b=B -> $b $a
  ;

where $a refers to the last value matched just like in actions, though here there is only one value:

r() {
	Object r_root = null;

	a = input.LT(1);
	match(A);
	b = input.LT(1);
	match(B);

	r_root = adaptor.nil();
	adaptor.addChild(r_root, b);		
	adaptor.addChild(r_root, a);
}

The label turns off the usual "add this to a list" functionality.

Loops and repeated elements

When you refer to an element multiple times or a single element is placed in a loop, multiple values for a single name will be available for translation. And here comes the interesting twist beyond simple "replace this list of tokens with that list of tokens in that order" functionality.

The tokens are queued up on the left during parsing and are used as a token stream for "unparsing" / building trees on the right side of the -> rewrite operator. This is similar to StringTemplate building up attributes before actually generating anything. To build repeat substructures in your trees, you need to use a loop in the rewrite section. It consumes tokens from the elements mentioned inside:

r : A A -> ( A )* ;

results in:

r() {
	Object r_root = null;

	A_list.add(input.LT(1));
	match(A);
	B_list.add(input.LT(1));
	match(B);

	r_root = adaptor.nil();
	int n = A_list.size();
	for (int i=0; i<n; i++) { // ( A )* loop
		subrule_1_root = adaptor.nil(); // each iteration like rule exec

		adaptor.addChild(subrule_1_root, A_list.get(i));

		// end of subrule, add whatever was created to rule's root
		adaptor.addChild(a_root, subrule_1_root);
	}
}

Note that all multi-valued elements must have the same size or a runtime exception occurs.

// build ^(A B) ^(A B)
r : A A B B -> ( ^(A B) )* ;

results in:

r() {
	Object r_root = null;

	A_list.add(input.LT(1));
	match(A);
	A_list.add(input.LT(1));
	match(A);
	B_list.add(input.LT(1));
	match(B);
	B_list.add(input.LT(1));
	match(B);

	r_root = adaptor.nil();
	int n = A_list.size();
	if ( n!=B_list.size() ) { throw new Exception(); }
	for (int i=0; i<n; i++) { // ( A )* loop
		subrule_1_root = adaptor.create(A_list.get(i));		
		adaptor.addChild(subrule_1_root, B_list.get(i));

		// end of subrule, add whatever was created to rule's root
		adaptor.addChild(r_root, subrule_1_root);
	}
}

I think that only (...)* are allowed in the rewrite section. (...)+ wouldn't add much except a runtime exception that some element was missing. Perhaps later.

Note:

r : ID (COMMA ID)* -> ID ;

is the same as

r : ID (COMMA ID)* -> (ID)* ;

by the way.

What about labels on multiple elements?

r : ids+=ID (COMMA ids+=ID)* -> ID ;

It just replaces ID_list with ids I guess. If you want to separate the two ID references, use different labels

// build ID ^(COMMA ID) ^(COMMA ID) ...
r : first=ID (COMMA others+=ID)* -> $first ^(COMMA $others)* ;

Ok, let's try a fairly complicated example:

decl : "var" (ID ':' type ';')+ -> ^("var" ^(':' type ID)+) ;

generates:

Object decl() {
	Object decl_root = null;

	VAR_list.add(input.LT(1));
	match(VAR);
	while ( input.LA(1)==ID ) {
		ID_list.add(input.LT(1));
		match(ID);
		COLON_list.add(input.LT(1));
		match(COLON);
		Object type_1 = type();
		type_list.add(type_1);
		SEMI_list.add(input.LT(1)); // optimize out later (since no ref)
		match(SEMI);
	}

	decl_root = adaptor.nil();
	decl_root = adaptor.becomeRoot(VAR_list.get(0));
	int n = COLON_list.size();
	if ( n!=ID_list.size() ) { throw new Exception(); }
	if ( n!=type_list.size() ) { throw new Exception(); }
	for (int i=0; i<n; i++) { // ( A )* loop
		subrule_1_root = adaptor.create(COLON_list.get(i));	
		adaptor.addChild(subrule_1_root, type_list.get(i));
		adaptor.addChild(subrule_1_root, ID_list.get(i));

		// end of subrule, add whatever was created to rule's root
		adaptor.addChild(decl_root, subrule_1_root);
	}

	return decl_root;
}

Nodes From Imaginary Tokens

A very common situation when building trees: you want to add a node to the tree for which there is no corresponding input token such as the ARGS node in this rule:

args: ID (',' ID)* -> ^(ARGS (ID)+) ;

Actually most often you are actually converting a simple input token to a more meaningful node:

body : '{' (stat)* '}' -> ^(BLOCK (stat)* ) ;

After discussing with John Mitchell, I was convinced that you always want to create an imaginary node from a real token; that is, you want the line/column information from it for error messages. You might also need it for getting at the off-channel tokens like whitespace and comments. Hmm..how would you do the first example though with ARGS? Perhaps:

args: first=ID (',' others+=ID)* -> ^( ARGS[from=$first] $first $others ) ;

Actually the syntax could probably be ARGS[$first]. Then again, sometimes I want to reset the text and want to do this: ARGS["ARGS"]. I think John suggested to me that we name the args: ARGS[text="ARGS", line=3] or the more complicated: ARGS[from=$lc, text="ARGS"] where lc is a label on the '{'.

The "constructor" syntax would get converted to:

Object ARGS_node = adaptor.create(lc);
adaptor.setText(ARGS_node, "ARGS");

StringTemplate rewrites

Can I do a simple integration of StringTemplate for use with the rewrite rules? Seems like it as tree rewrites are just tree templates. The input is a list of token objects, which the StringTemplate could work with.

args: ID (',' ID)* -> "variables: <ID; separator=\",\">" ;

or

args: ID (',' ID)* -> <<
variables: <ID; separator=",">
>>
    ;

or

grammar T;
options {
  output=text; // output member {AST, text, tokens?}
}
...
args: ID (',' ID)* -> argList(ids=ID) ; // shorthand for "<argList(ids=ID)>"

which creates an instance of template argList and sets its ids attribute to be the list of IDs. The argList template would probably look like:

argList(ids) ::= <<
variables: <ids; separator=",">
>>

and would be in a .stg file with other templates used by the grammar. We would need to add a Parser.setTemplateGroup() method so you can dynamically change the target.

Wow. This should work out trivially:

StringTemplate args() {
	ID_list.add(input.LT(1));
	match(ID);
	while ( input.LA(1)==COMMA ) {
		COMMA_list.add(input.LT(1));
		match(COMMA);
		ID_list.add(input.LT(1));
		match(ID);
	}

	StringTemplate resultST = templates.getInstanceOf("argList");
	resultST.setAttribute("ids", ID_list);

	return resultST;
}

How about an inline template example

method : "method" ID body -> "public void <ID> <body>" ;
StringTemplate method() {
	METHOD_list.add(input.LT(1));
	match(METHOD);
	ID_list.add(input.LT(1));
	match(ID);
	body_1 = body();
	body_list.add(body_1);

	StringTemplate resultST =
		new StringTemplate(templates, "public void <ID> <body>");
	resultST.setAttribute("ID", ID_list);
	resultST.setAttribute("body", body);
	resultST.setAttribute("METHOD", METHOD_list); // unused

	return resultST;
}

If I build this in, it will need a predefined property: $ruleLabel.template$ or $.st$? I guess the more explicit is better.

Tree parsing

For parsing trees, you can ask the adaptor to get the type and text etc... of a tree. Consider rule

e : ^(PLUS INT r=INT)
  ;

I would generate something like:

e() {
	match(PLUS);
	match(DOWN);
	match(INT);
	UserSpecifiedTreeType r = (UserSpecifiedTreeType)input.LT(1);
	match(INT);
	match(UP);
}

where match(PLUS); is a method that essentially executes:

if ( adaptor.getType(input.LA(1))!=PLUS ) {
	// error: looking for PLUS, but found input input.LT(1)
}

Again, I'm not using the type of the tree node in any way...only the adaptor knows that (or the programmer in grammar actions).

You may note that I'll be using a simple input tree node stream that flattens a tree into a stream of nodes plus navigation (DOWN and UP) nodes. This means that the LL(*) analysis can be used w/o modification to look arbitrarily ahead in tree grammars to predict alternatives! Wooohoooo! Add to the adaptor interface:

/** For tree parsing, I need to know the token type of a node */
public int getType(Object t);

For error reporting, I'll just call TreeParser.reportError(offendingTreeNode) which will know how to decode the node and get file, line, column information out of it.

Tree rewriting in tree parsers

What about building trees in a tree parser? Should be similar to the rewrite stuff in parsers.

e : ^(PLUS a=INT b=INT) -> {$a.getValue()==0}? $b
                        -> ^(PLUS $a $b)
  ;

Here, $label should be of UserSpecifiedTreeNodeType type whereas in the parser you need to use $label.tree to get at the tree created for a token.

Heh, what if you want the same tree to go back out if the predicate fails? Perhaps strip away all false preds and if none is left, then just return the existing tree. Works just like a rule with no rewrites I guess.

Tokens as tree nodes and payload issues.

Another problem. Does ^(PLUS $a $b) create new nodes from $a and $b or does it reuse? Should reuse, but how can I avoid cycles? Somebody could make something a child of itself like ^($a $a) or something. That is why I like the payload idea. The problem is that the tree node could literally be the Token object created by the lexer. No reason not to do it that way; it would say thousands of token objects. I think Paul Lucas from BEA (at the time) suggested this at ANTLR2004 workshop.

January 30-31, 2005

AST Construction Proposal for 3.0

The ideas in this proposal are certainly not purely mine. Specifically, many of them come directly from Loring Craymer's excellent work on ANTLR 2.8e. My thoughts on trees have been greatly influenced by other research as well as Loring's including conversations with my other usual research partners such as Monty Zukowski, John Mitchell, Ric Klaren, and many others (antlr-interest list folks). I would like to thank everyone for their hard work and continued aid in my research. Hopefully I'm picking and choosing the right bits from everyone.

First let me mention that, while I have thought a lot about this proposal, I'm sure I am missing something. After my stream of consciousness below I tried out this proposal on the java.g grammar and the new ANTLR 3.0 antlr.g grammar. All trees were trivially constructed using the new spec w/o resorting to actions to build trees.

I'm looking for a solution that is really simple and effective for 98% of the problems rather than a complicated solution that covers 100%. I prefer something simple like Apple's iPod that does almost everything I would want really really well. I will leave target language actions to cover the 2% of the grammars that the proposed tree solution cannot handle.

I would also like to point out that Loring Craymer, the really smart guy who has extended ANTLR to have lots of cool rewrite actions for 2.8experimental, vehemently disagrees with this proposal. I think I'm fair when I say that he hates the rewrite rules because:

  1. they don't cover a lot of functionality that his solution nicely covers.
  2. rewrites are more verbose for things like adding imaginary/artificial tree nodes or reordering the output.
  3. maintenance issues--changing a rule can be dangerous as you may forget to change the rewrite rule.

Until I see an example that is common that rewrite rules cannot cover, I'm unconvinced we need the extra complexity. For point 2, I'm not worried about the verbosity as it's only for a few rules and the exemplar-based approach is very clear. Point 3 is not a large enough danger in my opinion to warrant using a more complicated approach. Finally, the rewrite rule approach allows me to integrate StringTemplate and token stream rewrites with the same idea, preserving conceptual integrity. :)

Anyway, I'm posting this proposal to get feedback from the ANTLR community. I'm very happy to throw this out if we find serious flaws!

Operators

The current operator-based mechanism in ANTLR (^ and ! operators) that I designed while at a boring supercomputing conference back in '92 has proven extremely useful--it is insanely terse and covers your needs for perhaps 90% of the rules. I'm stunned other tools haven't copied this. I will keep these operators, but will add another operator, ^^, to overcome a weakness.

Here are the operators:

  • x^^ make element x the root of the current rule's tree regardless of nesting depth in a subrule. This operator then operates as ANTLR 2.x's ^ operator.
  • x^ make element x the root of the current subrule. I will extend to allow x to be a rule reference that returns a single node; a run-time exception will be thrown for rules that return more than a single node.
  • x! do not include element x in the tree.

To explain the need for the new operator, consider the case where you have a rule that needs to make a tree composed of at least two subtrees such as

decl : "var" (ID ':' type ';')+ ;

You want the "var" to be the overall decl rule root, but you also want the ':' to be a root of the ID and type. E.g.,

     var
   /    \
  :      :
 / \\    / \
x  int  y  char

for input

var
  x:int;
  y:char;

With 2.x, you could try:

decl : "var"^ (ID ':'^ type ';'!)+ ;

But you'd get the following wacky tree instead:

     :
   /   \
   :    char
 /  \\ \
var int y
 |
 x

The ^ in the loop make ':' the root of the entire tree rather than of the production in the loop. Sometimes you want ^ to mean subtree for the rule and sometimes for the subrule, so I made two operators. ^^ makes an element the root of the entire tree almost like the dot-dot ".." parent directory in Unix shells. Unfortunately, ^ (the old operator) will mean subtree root for current subrule which is different than before. So, here is the 3.0 spec that builds the right tree:

decl : "var"^ (ID ':'^ type ';'!)+ ;

or equivalently:

decl : "var"^^ (ID ':'^ type ';'!)+ ;

Tree rewrite rules

When the operators are insufficient, such as when you want to reverse the order of some elements, a simple tree template that looks like a rewrite rule would suffice. For example,

r : ID INCR -> INCR ID ;

builds a flat tree with nodes out of order compared to the automated mechanism. Subtree rewrites are also possible of course:

r : ID INCR -> ^(INCR ID) ;

where ^(...) replaces the old #(...) notation. This builds a tree like this:

INCR
 |
ID

What happens when there are repeated elements in the grammar (which required an important observation)? Let's redo the decl example above without operators, using a rewrite production instead:

decl : "var" (ID ':' type ';')+ -> ^("var" ^(':' ID type)+) ;

which explicitly builds the tree using a tree grammar; recall that formal grammars can be used to specify both the generation and recognition of a language. Think of the rewrite rules as generational grammars.

The important observation is the notion of the input grammar to output grammar syntax/semantics. I just looked again at TXL, a source translation system. It also has a nice "this becomes that" (using Cordy's terminology) syntax. TXL goes parse tree to parse tree, however, and I'm going from tokens to AST or, as you'll see later, back to tokens. I'll do a more thorough comparison later when I do academic papers.

This is actually much easier to understand than the operator-based solution because you are providing an examplar or template that says precisly what will be built. Operators force you to imagine the emergent behavior of the rule.

How does this work? All recognition elements on the left are "remembered" by collecting them into lists and then you can refer to them in the tree template. For example,

args: ID (',' ID)* -> ^(ARGS (ID)+) ;

collects all the ID tokens in a list and then adds them as children of the artificial token node ARGS. Please note that the rewrite rule is exactly the tree grammar you would use to recognize the tree built by this parser. The tree looks like:

  ARGS
 /  |
ID1 ID2 ...

What if you want a list of ARG trees:

ARG  ARG  ...
 |    |
ID1  ID2 

Simple. Just pull the artificial token into the (...)+ loop of the rewrite rule:

args: ID (',' ID)* -> ^(ARGS ID)+ ;

What if you want to treat different references to the same element type differently? Just use a label on the elements. For example, here is how to build a tree from a text Lisp-like tree description:

tree : '(' root=expr (children=expr)* ')' -> ^( @root (@children)+ )
     | expr // "-> expr" is implied
     ;

where attributes are prefixed with @ even in rewrite rules for consistency. It also makes it clear that you may be referring to an attribute defined in a shared scope like @symbols.names rather than an input grammar element.

Referencing expr by itself means all trees matched for expr references throughout the production.

To be clear, referencing @children within an action of the input grammar still only means the result of the most recent reference to rule expr.

Imaginary Nodes

In general, referencing a token type such as ID is actually shorthand for ID[] if ID is not referenced in the grammar to the left of the ->, which would get translated into something like factory.create(ID). You might also want to set the text of the imaginary token with ID[x] or DECL["DECL"] so the tree printing routines will be able to print something useful.

What happens when you need to make an imaginary node that also happens to be a grammar element on the left? The following makes a flat tree with the parentheses doubled:

r : LPAREN expr RPAREN -> LPAREN[] LPAREN expr RPAREN RPAREN[] ;

So in this unusual case we need to say LPAREN[] to make it clear we are creating an imaginary node.

Setting the token type of a node

In the Java grammar for 2.x there are about 30 places where I do this:

block : lc:LCURLY^ {lc.setType(BLOCK);} (stat)* RCURLY ;

which is really ugly. Using operators may not be possible in this case if we want to remove the setType calls. Potentially, we could do this:

block : LCURLY^[BLOCK] (stat)* RCURLY ;

But then what if you want the text for that BLOCK to be "BLOCK"?

block : LCURLY^[BLOCK,"BLOCK"] (stat)* RCURLY ;

which starts to get muddy. In this case, I would suggest:

block : LCURLY (stat)* RCURLY -> ^(BLOCK (stat)*) ;

I think it is much more clear.

What if you want to create an imaginary token but use the line info and such from a real token that will not get put into the tree like '{' to BLOCK?

block : LCURLY (stat)* RCURLY -> ^(BLOCK[LCURLY] (stat)*) ;

Perhaps the factory and do the appropriate translation.

Actions in rewrite rules

Sometimes you'll want to be able to reference an arbitrary expression to add a node to a tree:

r : A B -> ^({special-root-expression} A B) ;

This makes a tree with the result of evaluating a target language expression as the payload for root and with A and B and children.

Rewrite specs in subrules

You can rewrite the productions of a subrule as well. Be aware that the -> rewrite operator turns off automated tree construction operators for everything in that production and below.

Reconsider our decl rule. We can use a subrule rewrite to get the appropriate tree:

decl : "var"^ (ID ':' type ';' -> ^(':' ID type) )+ ;

In this case, each iteration of the (...)+ loop creates a new subtree according to the rewrite rule and then that result is automatically added as a child of the "var" node.

Rewrite rules at an outer layer override rewrite rules below them because there is no way to refer to a partial tree unless I introduce a way to label a subrule, which is possible. So, currently, you'll get an error if you have nested -> rewrites.

Hmm...after rereading this, I think maybe I should disallow rewrite rules for anything but the outermost prodution. The goal is a simple and obvious "here is what the rule generates" and subrule rewrites again force you to imagine what the emergent behavior will be since it behaves like an operator and in conjuction with operators at any higher levels.

Ok, so a day later, I see the need for -> rewrites in subrules, but they set the subtree for the entire rule. Here is the reason:

field
    :   mods=modifiers
        (   ctorHead constructorBody // constructor
            -> ^(CTOR_DEF modifiers ctorHead constructorBody)

        |   classDefinition[@mods.ast]       // inner class
...
    ;

We need to set the resulting tree differently depending on the subrules. This may not always work when you have two subrules in a row like

r : S (A|B) (C|D) ;

There may be trouble specifying a different tree depending on syntax.

Integration with attributes

If you refer to an attribute, x.y, in a tree rewrite rule, it becomes the payload object of a tree node. For example, the following rule creates a single AST node whose payload is the text of the ID token:

r : id=ID -> @id.text ;

whereas

r : id=ID -> @id ;

or just plain

r : ID -> ID ;

gives you a single AST node whose payload is the ID token itself.

References to rule labels gives you the AST subtree(s) created for that rule reference or references just as a token label gives you the token(s) matched for that token; e.g.,

r : e=expr -> @e ;

gives you a flat list of expressions and is equivalent to:

r : expr -> expr ;

Rewrites dependent on semantics

Use semantic predicate notation to specify multiple rewrites according to various semantic conditions:

expr : left=mul_expr PLUS right=mul_expr
        -> {@right.type==INT && Integer.parseInt(@right.text)==0 &&
            @left.type==INT && Integer.parseInt(@left.text)==0}? ^() // empty
        -> {@right.type==INT && Integer.parseInt(@right.text)==0}? left
        -> {@left.type==INT && Integer.parseInt(@left.text)==0}? right
        -> ^(PLUS left right) // default case
     ;

Notes

There is no way to refer to the "current tree" or whatever since -> specifies what the tree is and all automatic construction is off.

There is no way to use rewrite rules to do the operator trees for expressions without referring to the "currently constructed" tree so the tree operators are mandatory for expressions.

By construction, there are no side-effects of these rules. They just set the return AST for the rule.

Implementation

AST nodes

AST nodes will be simple objects with a payload field and a list of children--no more child-sibling lists. Rather than have heterogeneous nodes to hold different bits of data per node, just point at a different payload object. The tree should literally just represent structure, you can put whatever you want in there. This is pretty cool because then there is exactly one definition of an AST node to do anything you want.

One of the problems with ANTLR 2.x trees is that you can create cycles where a child pointer actually points back up the tree somewhere to a parent. Trying to walk that tree even to print it out will result in an infinite loop. Once we move the user's payload outside of the tree, multiple nodes can point at the same payload--you don't have to point at the same tree node, thus, likely introducing a cycle.

I'm getting more and more into separating/encapsulating concerns. The AST tree node should show structure between payload objects. The user's application defines a single or multiple payload objects that the tree can refer too.

So, literally the Common AST node may be:

class CommonAST implements AST {
    Object payload;
    List children;
}

The most common payload will be a Token object. ASTs point into the token stream and tokens point into the char stream. We are creating more and more sophisticated views/perspectives on the char stream as we lex, parse, build trees, etc...

Heterogeneous nodes

I don't like heterogeneous tree nodes: you should separate the model from the controller; just good design. If you put your action methods in model by having different node types all overriding action(), for example, it is a "one trick pony". You can only make one pass over that tree and nobody else can use the tree really. Multiple passes are the name of the game which would force you to have your model entangled with multiple methods representing multiple tree passes. Further, your controller is then spread across many many files; bad encapsulation. Passes should be grouped/encapsulated by pass such as "build symbol table" not by tree node. You don't want the "DECL node" pieces of 5 translation phases all in one file. It's the wrong way to slice things.

You need to get your controller out of your model--either use a visitor or, more properly, a tree parser which is really just a sophisticated visitor that happens to verify the structure rather than doing a dumb depth-first-walk.

I like the idea behind SableCC's approach of using types to guarantee tree structure by construction, but I'm opposed to using the heterogeneous trees to have action() methods for performing translation. Besides, using a tree parser checks your structure as it walks. Also note that a tree grammar can tell you when your tree structure is ambiguous, whereas a bunch of types cannot do that!

So, in summary, I am fairly certain to drop heterogeneous tree support for 3.0. Naturally, using actions, you can still build these hetero tree if you want.

Rewrite rules

I like rewrite rules because it moves all tree construction for a rule into a single "postfix" action carried out after all information is collected by the parser. In the presence of errors, this may make tree construction much more predictable and manageable (since it's in one spot).

Rewrite rules should be encoded in code rather than using a template library like StringTemplate does for text. For example,

r : A B C -> ^(B A C) ;

would generate roughly (for Java):

public void r() {
  Token t1,t2,t3;
  scope_r_return retval = new r_return_scope();

  // match
  t1=LT(1); match(A);
  t2=LT(1); match(B);
  t3=LT(1); match(C);

  // build AST
  AST t = null;
  t = factory.create(t2);
  t.addChild(t1);
  t.addChild(t3);
  retval.tree = t;
  return retval;
}

For repeatedly-matched elements, I'll have to build a list:

r : VAR (ID)+ -> ^(VAR (ID)+);
public void r() {
  Token t1,t2;
  List IDs = new ArrayList();
  scope_r_return retval = new r_return_scope();

  // match
  t1=LT(1); match();
  while ( LA(1)==ID ) {
    t2=LT(1); match(ID);
    IDs.add(t2);
  }

  // build AST
  AST t = null;
  t = factory.create(t1);
  for each id in ID {  // pseudo-code
    t.addChild(id);
  }
  retval.tree = t;
  return retval;
}

Naturally, I have no idea how to do all of this yet, but it shouldn't be too hard. It looks like the inverse of building a parser. I like having the two input and output concerns separated instead of entangled.

Future thoughts

I am looking at string templates and token streams as rewrites also. For example, there is no reason why a parser cannot act as a token stream filter by reading a token stream and spitting out a modified one. A token stream rewrite is just a flat tree rewrite. E.g.,

args: ID (',' ID)* -> (ID)+ ;

spits out the ID tokens back out without the comma.

This may be a more sophisticated version of the TokenStreamRewriteEngine object in ANTLR 2.x because you can insert text and such easily:

args: ID (',' ID)* -> "[" (ID)+ "]" ;

Nice, easy translation based upon syntax...back to text preserving all the whitespace and other hidden tokens in any rule that doesn't do a rewrite. I dig it.

Core dump

This is stream of consciousness when I was thinking about trees; may give you an idea of how I derived everything.

headerSpec
    :   (   "header"^ (id)? ACTION => ^( "header" id ACTION )
        )*
    ;

plus_expr : mult_expr (PLUS^ mult_expr)* ;

same as

plus_expr : mult_expr (PLUS mult_expr ^{ ^(PLUS plusexpr mult_expr) } )* ;
^{ tree construction action }

^(PLUS plus_expr mult_expr) means "make a tree with old tree as left child". There is an implied:

plus_expr = ^(PLUS plus_expr mult_expr);

so repeated invocations will be taller and taller tree?

Ok, operators ^ and ! are really useful, but let's try using =>.

optionsSpec
    :   OPTIONS (option SEMI)+ RCURLY => ^(OPTIONS option)
    ;

where option will be set multiple times by the loop? Hmmm...I might prefer:

optionsSpec
    :   OPTIONS (option SEMI)+ RCURLY => ^(OPTIONS <option>)
    ;

to make it clear it's a template to fill in. What if we wanted an OPTION node on top?

optionsSpec
    :   OPTIONS (option SEMI)+ RCURLY => ^(OPTIONS <option:{o|^(OPTION <o>)}>)
    ;

Boy, that would freak people out.

Now here's another rule with a list of stuff and you want OPTIONS on top:

optionList
    :   option (COMMA! option)* => ^( OPTIONS <option> )
    ;

I'm coming to the conclusion that the references in the ^(...) templates should be in a totally different namespace that the references in the grammar. Too confusing. Holes should be <...> I think and others like OPTIONS are literals that create nodes. OTOH, the following is very convenient:

^(PLUS <previous> <right>)

where you want the actual PLUS token found as the tree node payload since it has file/line info. Seems that naked PLUS has to be a literal: an imaginary node in the tree case. To reference the actual PLUS token, have to set it like anything else? Hm...wait. What if anything in <...> is a reference to a label or token or rule reference in the grammar. If it's not the same name then it's a new attribute you can set.

^(<PLUS> <plus_expr> <right>)

So for rule:

optionList
    :   option (COMMA option)* => ^( OPTIONS <option> )
    ;

you'd get an error:"ambiguous reference to rule result 'option' in tree ctor".

What about

optionList
    :   @o=option (COMMA @o=option)* => ^( OPTIONS <o> )
    ;

This means that o is set multiple times to the result of calling option. Uh oh. Is it the tree or the text or what of a rule reference? Ah. o=option would be different as it sets o to the attribute scope of option. Since @o is a tree attribute, you get the tree attribute o.tree. nah...@o should only be in actions.

optionList
    :   o=option (COMMA o=option)* => ^( OPTIONS @o ) // or @o.ast???
    ;

yeah, but then it's different notation than StringTemplate!

optionList
    :   o=option (COMMA o=option)* => ^( OPTIONS <o> )
    ;

Heh, let's look at @o again:

optionsSpec
    :   OPTIONS (option SEMI)+ RCURLY => ^(@OPTIONS @option:{^(OPTION @it)})
    ;

why be different than ST though? We'll want => "..." too.

Actions are @x.y rather than <x.y> because that is unique. What if we made it @<x.y>. Nah. These are not templates. Actions are @x.y.

Templates are <...>. If you want to refer to an attribute

optionsSpec
    :   OPTIONS (options=option SEMI)+ RCURLY
        => ^(OPTIONS <@options:{o|^(OPTION <o>)}>)
    ;

So @options refers to the grammar attribute outside the template. A simple template ref of <options> would not find any value. Damn! Very complicated.

Try action approach:

optionsSpec
    :   OPTIONS
            (options=option SEMI! ^{ ^(OPTION <option>) } )+ // how to implement?
        RCURLY!
        ^{ ^( OPTIONS <previous> ) }
    ;

or

optionsSpec
    :   OPTIONS
            options=( option SEMI => ^(OPTION <option>) )+
        RCURLY
        => ^( <OPTIONS> <options> )
    ;

where "options" is the result of the subrule and ^(...) outside of action just adds a child to current tree. Oooh. That's like Loring's stuff i think.

What about where you want to use ^ cause it's easy, but can't since you have two root nodes?

decl : "var" (ID ':' type)+ ;

decl : "var"^ (ID ':'^ type)+ ; // want ^ isolated to subrule

Try this:

decl : "var"^ list:(ID ':'^ type)+ ; // labeled subrule is a rule

So list:(ID ':'^ type)+ is a subrule or embedded rule. Has => overall result or ^{...} actions, which loop and execute multiple times.

Too subtle with either list= or list:. Booo!

decl : "var"^ (ID ':' type => ^(':' ID type) | foo)+ ;

Well, I guess the shorthand is just for when it really just works in an obvious way. Can mix I suppose:

decl : "var"^ (ID ':' type => ^(':' ID type) | foo)+ ;

But, can easily build via:

decl : "var" (ID ':' type => ^(':' ID type) | foo)+
       => ^("var" decl)
     ;

where decl refers to tree already built? Doesn't work. No tree built! Have to label everything:

decl : "var" v:(ID ':' type => ^(':' ID type) | foo)+
       => ^("var" @v)
     ;

So, ID and type refer to the tree for them but @v is the attribute (implicitly @v.ast) created for the loop? Boy, I sure don't like adding that @v in there. Would plain v work? That might be better. @ is only to highlight in an action.

decl : "var" v:(ID ':' type => ^(':' ID type) | foo)+
       => ^("var" v)
     ;

Here "foo" ref is automatically created tree as it has no => spec.

What about changing the token type? Hmm...make a new literal node

charSet
    :   LPAREN
            first=charSetElement
            others=( OR charSetElement => ^(OR charSetElement) )*
        RPAREN
	=> ^(CHARSET first others)
    ;

note that the charSetElement is considered unique as it picks the closest ref inside the (...)* scope. Can override this with a label.

But, how do you get line information? Set manually? Ugh, but what else to do?

Actually perhaps a new operator

plus_expr : mult_expr (PLUS^^ mult_expr)* ;

^ defaults to your scope but ^^ means affect the overall tree! Heh, that is awesome!!!! So it means more typing for each expression rule, but many ^ in a loop want the local scope. Perhaps make ^ the overall tree root and another symbol the "local root". ^^?

charSet
    :   LPAREN^ {@LPAREN.setType(CHARSET);}
            charSetElement ( OR^^ charSetElement )*
        RPAREN!
    ;

Seems like ^ should be local tree and ^^ should be overall, but that probably means more overall typing.

How to add imaginary node. Currently:

altList
    :   a1:alternative ( OR! a2:alternative )*
        {
        #altList = #(#[BLOCK,"BLOCK"],#altList,#[EOB,"<end-of-block>"]);
        }
    ;

perhaps

altList
    :   a1:alternative ( OR! a2:alternative )* => #(BLOCK altList EOB)
    ;

but => turns off auto construction. :( Ah. Wrap in a subrule?

altList
    :   foo=(a1:alternative ( OR! a2:alternative )*) => #(BLOCK foo EOB)
    ;

Ugly...

How about:

altList
    :   a1:alternative ( OR! a2:alternative )* => #(BLOCK alternative+ EOB)
    ;

alt :	(element)+	=> ^(ALT element+ EOA)
    |			=> ^(ALT EPSILON EOA)
    ;

where element+ means to refer to all values of element?

For decls,

def
	scope {AST decl;}
	:	m=modifiers t=type varDefs
		=> ^(VARDEF ^(MODIFIERS modifiers*) ^(TYPE type) decl )+
	;

varDefs!
	: 	@decl=declarator (',' @decl=declarator)* // set via scope
	;

declarator
	:	ID init {@decl = @declarator.ast;} // or can set here
	;

actually don't want any actions messing with trees! Always in => or with operators. For

public int t=3;

we want:

#(VARDEF #(MODIFIERS static) #(TYPE int) ID #(ASSIGN INT) )

Works great. So any tree ^(...)+ with a + or * on it is duplicated n times where repeated grammar elements are matched n times. all multi-valued elements must be same n size. Multi-valued elements must be grammar elements matched multiple times or an attribute set multiple times (e.g., the @decl case above). Repeated elements are considered single-valued elements to any enclosing ^(...)+ loop.

r : a B (c D)* => ^( B a ^(D c))+ ;

repeats entire ^(B ...) tree for each value of c and D matched. If no values matched then the tree is ^(B a).

r : a B (c D)* => ^( B a ^(D c)* ) ;

Here, only the subtree is generated multiple times for each c and D. Exactly like a tree grammar.

So, we specify a parser grammar and then the corresponding tree grammar production to match the resulting trees. Beautiful! Skip the middle man and automatically produce the tree grammar w/o having to generate it. Might still use/need Loring's tree grammar generator for operator-based generation.

How will we handle references to labels to differentiate which element refs? For example, how to build a tree from a text tree description?

tree : LPAREN root=expr (children=expr)* RPAREN => ^( root children+ )
     | expr // "=> expr" is implied
     ;

No need to use @root or @children as meaning is clear. A ref to an element means just the last value. A ref to element+ means ALL values created for it. If you say just element not element+ and there are multiple values, then you get a runtime tree construction error.

Should there be both * and + ???

tree : LPAREN expr (children=expr)* RPAREN => ^( expr children+ )  // WRONG!!!!
     | expr
     ;

A reference to expr gets all values even if some are labeled and some are not as in the above. In this case, the expr ref in the => section would produce as root the last child.

I like this => section as it happens at a very specific time: after all elements in that rule have been matched and it doesn't muck up the recognition part.

Hmm...for recursion such as:

r : A r => ^(A r)
  | B
  ;

what does r refer too? It should be the referenced element, but how do you refer to the current tree for a rule? We need this because there can be many subrules with => specs and they may want to set the root. May not be able to.

How to build expression trees with => rewrite specs?

expr : left=mul_expr (PLUS right=mul_expr => ^(PLUS left right) )*
     ;

That currently only sets the root of the (...)* block and hence would be a sibling of left. Also, how do you reset the value of left so you can build a taller and taller tree?

Ah, what if ^^(...) means set the overall tree and ^(...) sets the local? (or reverse because make overall root is the most common thing!)

expr : left=mul_expr (PLUS right=mul_expr => ^^(PLUS left right) )*
     ;

After one iteration, that would make the proper ^(PLUS left right) tree for the whole rule. Section iteration would lose the previous iteration's PLUS-rooted tree, however.

What if we allowed assignment in the spec:

expr : left=mul_expr (PLUS right=mul_expr => left=^^(PLUS left right) )*
     ;

Blows side-effect-free characteristic, but perhaps it's ok.

What about @expr.tree as an attribute

expr : mul_expr (PLUS right=mul_expr => ^^(PLUS @expr.tree right) )*
     ;

The => is scoped in the (...)* so the first mul_expr becomes @expr.tree right away and the => implicitly assigns to it. Wow. I dig it! Still, the side-effects make me really nervous.

Goal: be able to build most trees outside of actions so we can interpret. :)

How to handle different trees depending on actions?

expr : left=mul_expr PLUS right=mul_expr
          => {@right.type==INT && Integer.parseInt(@right.text)==0 &&
              @left.type==INT && Integer.parseInt(@left.text)==0}? ^() // empty
          => {@right.type==INT && Integer.parseInt(@right.text)==0}? left
          => {@left.type==INT && Integer.parseInt(@left.text)==0}? right
          => ^(PLUS left right) // default case
     ;

Nice and neat, but noninterpretable; oh well. In interp mode, just pick the one w/o a pred I guess.

Later for text templates, we should be able to do this:

expr : left=mul_expr PLUS right=mul_expr
          => {@right.type==INT && Integer.parseInt(@right.text)==0 &&
              @left.type==INT && Integer.parseInt(@left.text)==0}? "" // empty
          => {@right.type==INT && Integer.parseInt(@right.text)==0}? "<left>"
          => {@left.type==INT && Integer.parseInt(@left.text)==0}? "<right>"
          => "<left> + <right>" // default case
     ;

Cool.

For lists of stuff can do

decl : type ID (',' ID)* ';' => "<type> <ID; separator=\",\">"
     ;

elements in the grammar become incoming attributes to the "unparser" template. When you reference a template in a stg file, it would be:

decl : type ID (',' ID)* ';' => decl(type=@type,ids=@ID)
     ;

Hmm...a little weird when the arg and element are same name (e.g., "ID=ID") so I added the @ to @type to make it clear what is going on. A little strange as I don't normally require @ outside of user actions. Ack! Oh, maybe:

decl : type ID (',' ID)* ';' => "<decl(type=@type,ids=@ID)>"
     ;

which is totally clear; wait, same problem. I use just <type> above to reference incoming list of type values and <ID> to mean all the IDs. See what the ANTLR v3.0 code generator would look like using this notation. Probably lots of predicates.

@ID doesn't make sense as ID is a grammar element not an attribute or label. :( Will just have to accept:

decl : type ID (',' ID)* ';' => "<decl(type=type,ids=ID)>"
     ;

Perhaps it automatically pulls args from the list of matched elements if they match with the template def says. Means I need to load the template group in ANTLR. Try to reduce the jar dependency though through dynamic loading.

OK, it should be @foo for any reference to a label or attribute or whatever even if inside ^(...) so it's clear always what is a grammar element and what is a label/attribute. For example, you might want trees created from values not trees coming out of exprs.

duh : left=expr PLUS right=expr => ^( PLUS @left.value @right.value )
    ;

expr returns [int value]
    : INT {@value = Integer.parseInt(@INT.text);}
    ;
SIDEBAR

What if we could unparse a tree using the original input grammar?

For example if we build

DECL
 |
type
 |
ID -- ID -- ID

for the above grammar, could we use the input parser grammar to unparse? Perhaps we search the tree for type and then dump first. Then look for first ID and dump. Hmm...probably could get something to work, but easier to have it just remember input token positions and dump back. Either want to just spit back the input or are translating to something else, in which case, you don't care about having to give the output rules as those usually specify the transformations--you have to do it anyway.

BTW, tree grammars will look like:

expr : ^(PLUS expr expr) 
     | ^(MINUS expr expr)
     | atom
     ;

November 29, 2004

Random thoughts:

  • a reverse index of node types / text to tree nodes could be useful for a quick pattern matcher. Might be able to track what types are below which nodes also for fast search.
  • Currently, the sibling pointer really can screw up your trees if you try to make a subtree part of more than one tree. For example, when working on the 3.0 prototype I had a tree, b, returned from a rule. In certain cases, I wanted to wrap b in a bigger tree:
    #(BLOCK b EOB)
    
    the problem is that to make that tree, ANTLR sets the next-sibling pointer of b to point to EOB. Later, when I try to use b on its own, I get tree b EOB instead of plain b. Ugh. So, when using a list of children instead of child-sibling pointers, we would not have a problem with this. The BLOCK node would have two children, leaving what is under b alone. What about up/parent pointers? A subtree can only have one parent pointer. See next item.
  • Reference nodes. Is there a need to have an AST node type that is just a reference to another node? Does this solve the multiple parent problem? Perhaps not since when you are at the bottom of the subtree, you will not know who to go back to, the real parent or the parent of the reference node. Ref nodes useful for anything else?

November 10, 2004

After Loring bitched at me on the phone yesterday <snicker>, I'm beginning to think he's right: if separating the node data from the navigation is the right concept, then trees should be a single implementation that simply carry a "payload" object defined by the user. This is like Sun's MutableTreeNode.

The interesting thing is that the tree nodes themselves are all homogeneous, but the payload object could be totally different at each node. So, the parser creates exactly one kind of node with a variety of payload (or a single) types. The payload can be defined in the grammar itself so that ANTLR can generate the required type(s).

grammar P;

AST {
	Token token; // probably a default field
	String blort;
	...
}

Then ANTLR would generate P_AST type for use when building trees. For heterogeneous trees, we currently allow type names in the tokens section:

tokens {
	ID<AST=IDASTNode>; //something like that
	...
}

If we allow field specifications, that is better than specifying a type name maybe:

tokens {
	ID<String name, int n>;
	...
}

Whatever. A random idea.

As a bonus to the payload strategy, we can enhance the tree functionality later w/o forcing alterations in people's application; their payload objects still fit in our nodes.

This all comes at the cost of an additional object creation (payload creation + node creation).

Side note: Mitchell suggested Tokens and Tree nodes should have not only fixed fields like this, but the ability to dynamically acquire "attributes"; this would basically be a hashmap. It cuts down on a bazillion subclasses. It would be useful when parsing xml for example. The TAG token could have a list of tag attributes w/o creating subclasses etc...

November 9, 2004

I originally chose child-sibling trees because they are very efficient in terms of memory, but they have a "different" structure than the way we normally draw trees. For example tree #(PLUS 3 4) is drawn like this:

   PLUS
   /  \
  3    4

but stored like this:

PLUS
 |
 3 -- 4

I had the AST structure:

class AST {
  AST firstChild;
  AST nextSibling;
}

But there is a huge problem with this: a tree pointer not only points at its subtree, but points also at it's next sibling when it's in a list of children. Further, the previous child points at it. Essentially, the data at a node is mixed up with the structural information. Moving a node is complicated and we often end up with unintended cycles and other mixed up structures. The data structure also does not make it easy to doubly-link it, which is very useful during translation.

For 3.0: I am going to alter the basic concept here. Each node will have a list of children the list will be independent of the nodes such as an array of pointers. Copying a reference to a tree is easy, just do the pointer copy--you don't have to worry about previous or next pointers. Each node will have a link to its parent. Given a random tree pointer, however, it will be a linear search to find the next sibling. You have do this:

public Tree getNextSibling() {
	Tree p = getParent();
	int i = p.indexOfChild(this);
	return p.getChild(i);
}

Now that does not say we cannot have multiple interfaces over this basic data structure. I'm just saying that the implementation will make it much easier to do tree manipulations. I passed this by John Mitchell and he seemed to agree; so I'm sane within the bounds of John's sanity. ;) Perhaps Tree looks like:

public interface Tree {
    /** I am a child of what node? */
    public Tree getParent();

    /** Get a child 0..n-1 */
    public Tree getChild(int childIndex);

    public Tree getPreviousSibling();

    public Tree getNextSibling();

    public Tree getFirstChild();

    // should this be part of interface?
    public int indexOfChild(Tree t); 
}

The immediately obvious problem with this interface (and ANTLR 2's AST interface) is that it collides with DOM trees--they have the same method names, but DOM as a return type not Tree. I suppose I can just use different names so a tree implementation could "implements" both DOM and ANTLR's Tree. In Old English, "sibling" was "sibb". What about the dreaded "brother" or how about "kindred"? "kin"? "kinsman"? "Neighbor" (good but icky to spell)? Thesaurus list nothing else interesting. How about getNext(), getPrev(), and getChild()? Oh, how about the more direct: getUp, getDown, getLeft, getRight? Yeah, that works. :) How about this?

package org.antlr.runtime.tree;

/** What does a generic tree look like?  AST and parse trees extend
 *  this interface.  This is called Tree instead of node or whatever
 *  because node is just a single-node tree.
 *
 *  This is the minimal definition of a tree usable by all ANTLR
 *  concepts such as parse tree and AST building plus tree parsing;
 *  technically we don't even need the parent, but that can just
 *  return null if your implementation doesn't need it.
 */
public interface Tree {
    /** I am a child of what node? */
    public Tree getParent();

    /** Get a child 0..n-1 */
    public Tree getChild(int childIndex);

    public Tree getDown();

    public Tree getRight();

    public Tree getLeft();

    /** Synonymous with getParent(); do we need this? */
    public Tree getUp();

    // should this be part of interface?  Perhaps it's just part
    // of impl.
    public int indexOfChild(Tree t);
}

Question: should null be allowed as a valid child list item? It would be a convenient means of doing a delete. I think it should not be allowed because then getNextSibling() won't work. If you get null back you couldn't move to the next sibling (how do you search for null if there are multiple?). A further restriction is that you should not add the same tree more than once if you want the getNextSibling() stuff to work. This would be too expensive to enforce probably.

November 30, 2003

Transformations with array of ast down/right pointers instead of just one. To copy a tree node from pass i to j is just right[j] = right[i] etc... Adding, deleting nodes from phase to phase is ok as links are in separate planes so won't affect previous planes. You don't have to wipe out old pointers to move a node into new phases as old pointers are not seen. Dramatic reduction in number of node allocations as node n is reused from pass i to pass j if it is not altered.

Do we have to find a way to force setXXX() methods on trees to dup before write so that previous phases aren't affected? Nah, don't really care about history unless you want to roll back.

Token stream is perhaps at phase / plane 0, a flat tree of the input tokens. Might be useful for unparsing to text (in target language) :)