|
Home |
Download |
ANTLRWorks |
Wiki |
About ANTLR |
Feedback |
Support |
Bugs |
v2
|
|
|
Latest version is 3.0.1 Download now! » |
|
|
ASTs, Parse Trees, etc...
April 20, 2007Hmm... ^(PARMS parms+)? doesn't work. There are no real elements in the (...)? So, unlike loops then, the optional subrule has to check deeply for referenced elements. Alter define.g so that it sets a currentOptionalBlock not just current block. Add all elements to it? April 19, 2007Ok, got a problem. For multiple IDs in the following grammar, you
should get multiple copies of both type and modifier result
trees.
Should get trees like:
Currently I get
Once the modifier is consumed in the modifier? it is not visible the
next time around the loop:
The condition stream_modifier.hasNext() is false after first execution of the optional subrule. One way to think about this is to reset the model fire stream after the optional subrule, but this makes the while conditionthink there are an infinite number of modifier trees in the stream_modifier. Certainly it seems like a good idea to reset the streams after a
subrule so that one can repeat a subrule and get a duplicate of
everything inside; e.g.,
This should create two duplicate lists of trees, but currently you will get a cardinality exception because type and ID will run out of elements in their streams after the first subrule. It seems clear that the main issue is nested subrules. I am using a condition for the outer subrule that tests how many elements are left in each stream referenced at any depth in the subrule. Any streams that get reset in a nested subrule cause an infinite outer loop. We need to be able to reset streams for nested subrules without affecting the condition of the enclosing subrule(s). Ok, let's think about what ^(type modifier? ID)+ means; that is, how
many times does that closure loop go around? The answer is that it
goes around as long as there are elements left to emit. The only
problem comes in when we need to duplicate. Of course, we only
duplicate when the element is a size 1, which may lead me to a
solution here. Let's consider the case where all streams have multiple
values:
In this case, there can be no duplication. We do not want to reset
stream_INT after the optional subrule because we need to keep pulling
from the stream not duplicate the first element. If there is a
cardinality difference, it is an error. Now, let's consider the case
where all elements have cardinality 1:
Loops can only go around once if all elements have the same cardinality and that carnality is 1. In this case, the optional subrule should not reset stream_INT because we want to exit the outer loop. There are no more ID nor INT elements to iterate on. The loop should enter when either stream has a value. The problem then seems to be only when we have a mismatch between
multi-valued streams and single valued streams:
Wait. Perhaps the answer is to avoid testing the streams for nested
subrules. For example, the following is weird because there are no
elements that the loop knows how to test: expr is in the subrule and
the outer subrule does not know when to keep looping.
Now, with the following semantically-equivalent spec, we get one that
works:
You will get a series of EXPR-rooted trees with single expr
children. The loop will test for the only non-imaginary element in the
loop: expr. Turning back to our original problem:
it seems that we should test only type and ID streams:
Ha! I think that's the trick! April 14, 2007With ref to labels, I think that we need to create streams for
single-valued labels and list labels. Good, looks like it already
tracks elements in two lists; e.g., ids+=ID (',' ids+=ID)* yields (partially):
Ok, so we just have to convert List to RewriteRuleTokenStream or whatever. Ack! These are accessed from actions on left side of -> too. I guess we have to create a stream at start of -> that feeds off that list. For single-valued labels such as t=type, we need to create:
and then use list_type.next() not t.tree in the rewrite section. We have referencedTokenLabels etc..., computed by codegen.g PER block. I think we need an overall list and then I can just create the stream objects at start of -> code section Crap: we also need to know whether the label is a token label, rule label, or list version of that. Ok, I can compute that. Ack. What about multiple -> in subrules for a single rule? How does that affect list of rewrite elements? Track per rewrite? or globally per alt? Ack. What about multiple -> separated by predicates? Hmm...have to prepare for any of them to execute. Treat them as one rewrite in terms of labels and stream set up.
April 13, 2007ooops. What about labels? Seems like they should go into a stream too so the code is uniform. Wait. Labels simply point at the last Token or subtree tracked for a
rule element.
This will yield the following defs:
and then
Ok. Referencing $x and $y means what? How does $x differ from A in
the rewrite rule in this case? Hm...well, let's look at what we do
currently. The rewrite code gen does this:
This seems ok, but what about duplication?
This collects elements like this:
We need to duplicate type's return tree for every iteration of the
rewrite loop. No copying at moment:
Instead of t.tree, it should use list_type.next() I believe. Wait, no
that doesn't let me distinguish between first and following elements
needed for:
So, $main should literally be main.tree but how do we duplicate it if it's in a loop like $t above? April 12, 2007Ok, I think I have a new understanding or perspective on rewrite rules. We should imagine the various lists of tokens as separate input streams that the rewrite rule can draw from. We do not need to check the cardinality upfront--we can simply keep pulling from the list until we run out, at which point we can throw something analogous to a mismatched token exception. Avoiding the complex cardinality checks is a big win. This mechanism also makes it easy to handle node duplication. If you're stream size is 1 and you are asked to produce more than one node, simply duplicate the first element. We will need to wrap each token reference in a method call or use a special List.That way, I can also avoid having to use indexes. Each list will act like a token stream that contains a cursor. next() can pull from the stream instead of get(i). Let's look at a few examples:
That looks great. Hmm...how do we know whether to enter a subrule? If it is a single element or all simple elements (no nested subrules), it is easy:
If for some reason there are was not an A or B, next() would throw an exception. That seems ok. Now, what do we do about subrules with random elements such as other
subrules?
How do we know whether to enter the optional subrule? To answer that question, we have to know whether the subrule will generate a tree. Knowing that requires recursively examining each element. Hmm.. I guess that is it: we need to simply recurs down to find all token and rule references mentioned within a subrule. If any of those have elements last, we enter the subrule. If there is a mismatch between cardinality, we will find it after we enter. Here's a weird extreme case:
The entry expression for the outermost subrule is: if (A or B or C or D have more elements) ... The first nested subrule would test the same list. Basically, the rule is probably: if there is a reference to an element that still has a token left to add to the tree, we should enter any surrounding subrule. The (A B)+ subrule is entered if either A or B has a token. The remaining subrules are obvious. How do we deal with checking cardinality for 0 versus 1? For example, (A B)? should give an error if:
The subrule is entered if either A or B have tokens as a general rule
regardless of the type of subrule--condition 1 could happen. In that
case, one of the next two statements will fail:
I guess that is what we want...the subrule would not have been entered if neither A nor B had tokens. Now, what about loops like (A B)+? The loop will enter and continue
as long as either A or B have elements. To be valid, |A|=|B| must hold
or |A|>1 and |B|=1 or vice versa. In either case, at least one of the
lists will have the token so the loop will enter. Let's consider the
case where |A|=0 and |B|=1 to see if we catch it:
Ok, that works and so would |A|=1 and |B|=0. Let's try the case where
both are n=|A|=|B|. That would work also because each iteration
would pull one value from the lists. What about the case where |A|=1
and |B|>1, say, |B|=2? We may have a problem knowing when to "smear" the As
(duplicate for each B).
What about repeated references to the same element? A -> A A should
copy, clearly. What about A A -> A A? Seems like that should
simply have two output creation sites that draw from the same list:
So what about A+ -> A A*? This would go against what the book says
I think, but makes sense in this new perspective:
In the book, I say that if you want to treat the first of several
elements differently, you need to use labels; But, I'm not sure that
is true. For example, I said you had to do this:
but, why not do this:
This way we can look at the term method as simply a source of
subtrees and we use them up as the references go along. This does not
give you very much flexibility though in terms of the order. For
example the following does not work in that you never get a proper
MAIN subtree (there will not be a method child):
I think one of the things that I'm trying to do is remove the ordering problem. I guess this is a rare case and then you could still use the label solution. Most of the time, people will only have one reference. If there is more than one reference, let's imagine that it consumes from left to right lexically. Ok, more copying concerns. How do we know when to copy when we are not
in the loop? What code is generated for A -> A A? Perhaps this:
What about A A -> A A A, though?
It seems that we should cause an error rather than duplicate. We do not want to have a rule where we smear the last element out because otherwise that would screw up loops that had elements with mismatched cardinality. Ok, seems like the rule should be: next() duplicates if |list|=1 and already emitted list.get(0) Let's check some duplication related to loops:
That should generate something like the following:
What about the following? How long should the loop go around
The loop will enter the first time because 'b' has one element but
should only go around once once we've consumed it.
Ok, this all seems very consistent to me. Hooray! Let's try to formalize this:
The only pain is to recursively find the list of referenced elements. April 11, 2007Ok, the AST rewrite cardinality checks and node duplication has some
bugs. I need to rethink to be more consistent and correct. One
issue is that the current code generation assumes that all elements
within a loop at the same cardinality, but this is not always true:
In this case, we need to duplicate 'int' for every ID found in the
input string. Input
should create the following tree:
Further, its cardinality is 1, which is not equal to the cardinality
of the ID list in this case. The same is true for user rule:
We need to duplicate the tree return from type each iteration through
the tree construction loop in the rewrite. Previously I had only done
the duplication when the rule was mentioned twice as in:
My goal in this entry is to brainstorm about how each EBNF rewrite construct should be translated to code. Let's take some examples to see if we can extract what to do, given what ANTLR does now. Let's starting with the optional EBNF construct: (...)? Zero-or-one blocks
(...)* Zero-or-more blocks
Ok, now I'm remembering... If everything inside the loop has cardinality 1, then loop goes around
n=1 times. If at least one element has cardinality > 1, then n =
cardinality of that element. All elements with cardinality > 1 must
have exactly the same cardinality. What about cardinalities 1 and 0,
as in:
That will fail if |B|=0. What would the rewrite produce? So, let's add a rule: If all elements are <= 1, then all must same cardinality. Ok, so cardinalities can be:
Ok, I guess we can say: all cardinalities must be same, n, 'cept that for blocks with all non-zero cardinalities can be n or 1. Hmm...what about non-elements like nested subrules? Nested subrules
June 20, 2006Currently we cannot do tree->tree transforms using tree grammars. If you try, you can see that the basic mechanism borrowed straight from parser -> AST almost works. My concern is how to deal with sharing tree nodes. Clearly if you only want to go through and replace x*0 with 0 everywhere in expressions you should be able to reuse the rest of the tree; i.e., without dup'ing it. I think the way to deal with that is that if you don't create a tree, return the incoming tree pointer as the result tree. Bernhard Damberger: Two kinds of lines of code are causing problems: adaptor.setTokenBoundaries(retval.tree,retval.start,retval.stop); In this case retval.start and retval.stop are Objects, not Tokens. Yes, that would need to be turned off when building trees from a tree; those values are set and should not be touched. And
on the third line because string_literal4 is expected to be a token, but its a Object. We could add that method create(object) to the adaptor no problem. The issue is that we'd be dup'ing nodes like crazy. I think tree grammars should by default just return the input tree.
If you have -> rewrites then a new subtree for a rule is created. ^
and ^^ would not be allowed and I'd like to disallow ! also to be
consistent. ! could be useful but mostly likely you would want to
remvoe a node upon some condition, hence, just saying ID! is not that
useful. If you unconditionally knew you didn't want ID in the tree,
you could have used ID! when creating the tree. Actually, you could
also reuse old subtrees too:
So every rule would by default terminate with a hidden "-> $rule" Perhaps you'd also be interested in inserting nodes anywhere in the tree when you see, for example, an implicit var def (x=3). You'd want to insert a def for x somewhere above. You could say engine.insertAfter($block.getFirstChild(), ^(DECL, TYPE["int"], $ID)) or some such. Hmm...needs more thinking... April 26, 2006updated April 27, 2006 Musings about adding AST debugging to debug event listener ANTLRWorks. FunctionalityShow tree being built. Sequence of events coming from running parser. Stepping by location will show nicely. Tree view for smaller trees and "directory" view for big trees. Do we show all subrule roots or just rule's root? How would we display?
As subrules create trees, stack the trees and then hook them together as subrule finishes (animate?). Events can be sent like r0=..., r1=..., r2=... where the n indicates depth like stack and r0 is rule's root ptr. Every rn=... signal could mean start a new subtree below current (pushing up existing to make room or something). A close rn signal is sent to say done with that tree. Do we allow users to see trees returned by rule refs?
Nodes have start/stop token index set that is the range of tokens for that node and below from which this subtree was created. Useful to highlight input sequence associated with start/stop index for a highlighted node. Setting up adaptor. Current tree construction:
We should change to be DebugTreeAdaptor but I won't auto-wrap the adaptor in setTreeAdaptor...they might want to set their own debug adaptor.
Eventscreate nil (even nil nodes have a unique ID...they are not "null" per se) r0 = nil // signal that tree construction for r0 commences create node from token index t; assigns unique node ID (stored in tree) becomeRoot node ID, root ID r0 = node ID (as returned from becomeRoot) addChild root ID, n addChild r0.ID, r1.ID close r1 // do we need this for AW??? close r0 // if so, do we need this or is it implied by rulePostProcessing rulePostProcessing(tree) // converts ^(nil x) to x only so far setTokenBoundaries r0.ID, start, stop // r0 might be new node label = node ID // rule return labels becomeRoot token index, root ID // rewrites addChild root ID, token index // rewrites Ok, added code; might be like this:
IssuesHow to identify nodes so we can say "add node to a prior node"? Even becomeRoot is an issue. Ok, number the nodes as they are created? Will have to wrap the tree and delegate to get the unique node ID...create a DebugTreeAdaptor that delegates to a TreeAdaptor? Yep. Ah, nope...that will mess of tree construction. Use a Map<Tree,Integer> or can we get away with a node's hashCode? No, two identical nodes could be in tree and hashCode would be same if they implement that method. Damn...no way to get address. I wonder if we can check to see if they implement hashCode and if so use the super's hashCode. Nope...node.super.hashCode() doesn't parse. Let's assume hashCode for now. Equals will have to be implemented, but hashCode must remain unimplemented against Java doc. No biggie as it's unlikely that people will add AST nodes to HashMaps. When creating nodes from tokens or doing addChild/becomeRoot, can I refer to tokens by their index since AW has the list? Do we step through tree construction in rewrite rules? Can emit location events I suppose. Useful? Say no at beginning but still need to show tree at end of rule. Final location will show this I think. Throw out tree on error? Do we return partial tree now? Are tree actions turned off during recovery mode? Seems like they should be. What kind of trees do we work with? TreeAdaptor knows how to get the text and token indexes (if any). Create event sends ID, text, token type Test GrammarThis grammar sort of shows what events are triggered (in generated code).
Ok, some real sample output. From, this grammar
and this main:
I get this output for "abcde":
June 22, 2005Rewrite rulesThe new rewrite stuff is a grammar to grammar mapping where you specify how to match something and then how to generate the resulting tree. Note that you can generate a token stream by simply generating a flat list like:
The goal is to avoid arbitrary actions and the key insight to get pure rewrites was that the right side of the -> should be a tree grammar complete with (...)+ loops etc... Monty, Loring, and I discussed the rewrite rules extensively at the Oregon cabal 2 years ago but we couldn't figure out what to do with repeated elements like:
How can you generate just the list of IDs, for example. Easy. Just list the tree grammar fragment that would match what you would like to generate and ANTLR will figure out how to do the transformation. The elements on the left are queued up and then used as input streams to the "unparser" grammar on the right. Here is the rewrite to get the list of IDs:
Nice, eh? What if I want an imaginary token node, VAR, above the list?
What if I want a VAR for each ID? Move the VAR inside the plus loop:
I think that is pretty slick. Note that to get your tree parser grammar, ANTLR should be able to copy these rules for you into a tree grammar similar to what Loring does, although his 2.8e tree grammar creator is more sophisticated than what I have in mind. I have not done the tree parsers yet, but I think it will be really easy. The details may kill me though. ;) I should push an ANTLR-3.0ea3 release shortly. NotesMy brief comments in the README.txt file: 1. Automatic tree construction operators are in: ! ^ ^^ 2. Tree construction rewrite rules are in -> pred1? rewrite1 -> pred2? rewrite2 ... -> rewriteN The rewrite rules may be elements like ID, expr, $label, {node expression} and trees ^( <root> <children> ). You have have (...)?, (...)*, (...)+ subrules as well. You may have rewrites in subrules not just at outer level of rule, but any -> rewrite forces auto AST construction off for that alternative of that rule. To avoid cycles, copy semantics are used: r : INT -> INT INT ; means make two new nodes from the same INT token. Repeated references to a rule element implies a copy for at least one tree: a : atom -> ^(atom atom) ; // NOT CYCLE! (dup atom tree) 3. $ruleLabel.tree refers to tree created by matching the labeled element. ExamplesHere are some example rewrites. I have about 40 functional tests for rewrites and about the same for automatic tree construction.
June 8-9, 2005Ok, I've cut a development branch from ANTLR 3.0ea1 and am starting to work on the tree component. There are four main problems:
Tree structure and constructionAs I've mentioned previously, I'm moving away from using child-sibling trees for the common tree implementation using a simple list-of-children approach. To avoid locking everyone into the same implementation, I'm going to use an approach similar to the TreeModel thingie in Swing's tree viewing classes. Stanchfield got me thinking about this; let's see how it works out... Let's try out an example:
This should generate something like:
Ok, that took me about 6 rewrites to get something that also works for more complicated (double root operation) stuff like:
which should generate something like:
Note that the tree construction occurs after the token has been matched so that errors occur before a bogus node can be added to the tree. Instead, one can add an error node in the tree to mark the problem spot. Interestingly, the notion of a TreeAdaptor, means that I don't actually have to know any of the methods of a Tree. I can literally use Object! The only reason to know the type is so that actions can do $label.tree.myOperation() without having to cast $label.tree. I should still define a basic Tree interface so we can get the text and payload out etc... Hmm...shouldn't the adaptor do this too? Normally, I'd do
but we could also use:
This is a very non-OO thing to do (more like C), but it does provide some breathtaking flexibility, though I'm not sure we need it. Naturally as the programmer, you will know what kind of tree you have and can directly say t.getText() etc... Heh, do I really need the "payload"? That's an implementation detail and as long as an adaptor can answer questions about text, line, column, etc... who cares how they store stuff? The only thing I want to make sure is that you can't create cycles very easily when constructing trees. I assumed everything was going to be a payload object, but now, the adaptor will say how to create a node from a Token or other object. My TreeAdaptor looks like this so far:
Tree construction operators and subrulesHow to translate?
Try this:
That seems easy. How about:
Should build taller and taller tree as more PLUS are seen:
Here you must imagine that anything matched within a subrule is added as a child to the rule's current root. Since I have PLUS^^ in there, the current rule's root is set to that node so the second INT is added as a child of it. If one used postfix notation, it would still work:
Reversing the order shouldn't affect anything:
Ok, something harder:
The ^^ must set the outer rule tree:
After each execution of a subrule or subrule iteration, the tree is added as a child to the current tree. If the subrule's tree is flat, all the nodes created in the subrule are added as children of the current rule's root node. Trees from rule refsHow to handle tree results from rules?
Each rule has an implicit tree return value (which can be accessed via $ruleRef.tree in actions).
Tree rewrite rulesWhile most tree construction can be handled by the ^ and ^^ operators, sometimes you want to specify the structure explicitly in a declarative fashion rather than as the emergent behavior of the tree operations. The general syntax is:
Sometimes you need semantic to dictate which rewrite to use. Semantic predicates work great here:
Simple no-loop alternativesIn alternatives without subrules, rewrite rules are trivial as they are simply reorderings etc...
To generate trees from rewrite rules, just reverse the recognition process:
Ok, how about labeled stuff?
where $a refers to the last value matched just like in actions, though here there is only one value:
The label turns off the usual "add this to a list" functionality. Loops and repeated elementsWhen you refer to an element multiple times or a single element is placed in a loop, multiple values for a single name will be available for translation. And here comes the interesting twist beyond simple "replace this list of tokens with that list of tokens in that order" functionality. The tokens are queued up on the left during parsing and are used as a token stream for "unparsing" / building trees on the right side of the -> rewrite operator. This is similar to StringTemplate building up attributes before actually generating anything. To build repeat substructures in your trees, you need to use a loop in the rewrite section. It consumes tokens from the elements mentioned inside:
results in:
Note that all multi-valued elements must have the same size or a runtime exception occurs.
results in:
I think that only (...)* are allowed in the rewrite section. (...)+ wouldn't add much except a runtime exception that some element was missing. Perhaps later. Note:
is the same as
by the way. What about labels on multiple elements?
It just replaces ID_list with ids I guess. If you want to separate the two ID references, use different labels
Ok, let's try a fairly complicated example:
generates:
Nodes From Imaginary TokensA very common situation when building trees: you want to add a node to the tree for which there is no corresponding input token such as the ARGS node in this rule:
Actually most often you are actually converting a simple input token to a more meaningful node:
After discussing with John Mitchell, I was convinced that you always want to create an imaginary node from a real token; that is, you want the line/column information from it for error messages. You might also need it for getting at the off-channel tokens like whitespace and comments. Hmm..how would you do the first example though with ARGS? Perhaps:
Actually the syntax could probably be ARGS[$first]. Then again, sometimes I want to reset the text and want to do this: ARGS["ARGS"]. I think John suggested to me that we name the args: ARGS[text="ARGS", line=3] or the more complicated: ARGS[from=$lc, text="ARGS"] where lc is a label on the '{'. The "constructor" syntax would get converted to:
StringTemplate rewritesCan I do a simple integration of StringTemplate for use with the rewrite rules? Seems like it as tree rewrites are just tree templates. The input is a list of token objects, which the StringTemplate could work with.
or
or
which creates an instance of template argList and sets its ids attribute to be the list of IDs. The argList template would probably look like:
and would be in a .stg file with other templates used by the grammar. We would need to add a Parser.setTemplateGroup() method so you can dynamically change the target. Wow. This should work out trivially:
How about an inline template example
If I build this in, it will need a predefined property: $ruleLabel.template$ or $.st$? I guess the more explicit is better. Tree parsingFor parsing trees, you can ask the adaptor to get the type and text etc... of a tree. Consider rule
I would generate something like:
where match(PLUS); is a method that essentially executes:
Again, I'm not using the type of the tree node in any way...only the adaptor knows that (or the programmer in grammar actions). You may note that I'll be using a simple input tree node stream that flattens a tree into a stream of nodes plus navigation (DOWN and UP) nodes. This means that the LL(*) analysis can be used w/o modification to look arbitrarily ahead in tree grammars to predict alternatives! Wooohoooo! Add to the adaptor interface:
For error reporting, I'll just call TreeParser.reportError(offendingTreeNode) which will know how to decode the node and get file, line, column information out of it. Tree rewriting in tree parsersWhat about building trees in a tree parser? Should be similar to the rewrite stuff in parsers.
Here, $label should be of UserSpecifiedTreeNodeType type whereas in the parser you need to use $label.tree to get at the tree created for a token. Heh, what if you want the same tree to go back out if the predicate fails? Perhaps strip away all false preds and if none is left, then just return the existing tree. Works just like a rule with no rewrites I guess. Tokens as tree nodes and payload issues.Another problem. Does ^(PLUS $a $b) create new nodes from $a and $b or does it reuse? Should reuse, but how can I avoid cycles? Somebody could make something a child of itself like ^($a $a) or something. That is why I like the payload idea. The problem is that the tree node could literally be the Token object created by the lexer. No reason not to do it that way; it would say thousands of token objects. I think Paul Lucas from BEA (at the time) suggested this at ANTLR2004 workshop. January 30-31, 2005AST Construction Proposal for 3.0The ideas in this proposal are certainly not purely mine. Specifically, many of them come directly from Loring Craymer's excellent work on ANTLR 2.8e. My thoughts on trees have been greatly influenced by other research as well as Loring's including conversations with my other usual research partners such as Monty Zukowski, John Mitchell, Ric Klaren, and many others (antlr-interest list folks). I would like to thank everyone for their hard work and continued aid in my research. Hopefully I'm picking and choosing the right bits from everyone. First let me mention that, while I have thought a lot about this proposal, I'm sure I am missing something. After my stream of consciousness below I tried out this proposal on the java.g grammar and the new ANTLR 3.0 antlr.g grammar. All trees were trivially constructed using the new spec w/o resorting to actions to build trees. I'm looking for a solution that is really simple and effective for 98% of the problems rather than a complicated solution that covers 100%. I prefer something simple like Apple's iPod that does almost everything I would want really really well. I will leave target language actions to cover the 2% of the grammars that the proposed tree solution cannot handle. I would also like to point out that Loring Craymer, the really smart guy who has extended ANTLR to have lots of cool rewrite actions for 2.8experimental, vehemently disagrees with this proposal. I think I'm fair when I say that he hates the rewrite rules because:
Until I see an example that is common that rewrite rules cannot cover, I'm unconvinced we need the extra complexity. For point 2, I'm not worried about the verbosity as it's only for a few rules and the exemplar-based approach is very clear. Point 3 is not a large enough danger in my opinion to warrant using a more complicated approach. Finally, the rewrite rule approach allows me to integrate StringTemplate and token stream rewrites with the same idea, preserving conceptual integrity. :) Anyway, I'm posting this proposal to get feedback from the ANTLR community. I'm very happy to throw this out if we find serious flaws! OperatorsThe current operator-based mechanism in ANTLR (^ and ! operators) that I designed while at a boring supercomputing conference back in '92 has proven extremely useful--it is insanely terse and covers your needs for perhaps 90% of the rules. I'm stunned other tools haven't copied this. I will keep these operators, but will add another operator, ^^, to overcome a weakness. Here are the operators:
To explain the need for the new operator, consider the case where you
have a rule that needs to make a tree composed of at least two
subtrees such as
You want the "var" to be the overall decl rule root, but you also
want the ':' to be a root of the ID and type. E.g.,
for input
With 2.x, you could try:
But you'd get the following wacky tree instead:
The ^ in the loop make ':' the root of the entire tree rather than
of the production in the loop. Sometimes you want ^ to mean subtree
for the rule and sometimes for the subrule, so I made two operators.
^^ makes an element the root of the entire tree almost like the
dot-dot ".." parent directory in Unix shells. Unfortunately, ^
(the old operator) will mean subtree root for current subrule which is
different than before. So, here is the 3.0 spec that builds the right
tree:
or equivalently:
Tree rewrite rulesWhen the operators are insufficient, such as when you want to reverse
the order of some elements, a simple tree template that looks like a
rewrite rule would suffice. For example,
builds a flat tree with nodes out of order compared to the automated
mechanism. Subtree rewrites are also possible of course:
where ^(...) replaces the old #(...) notation. This builds a tree
like this:
What happens when there are repeated elements in the grammar (which
required an important observation)? Let's redo the decl
example above without operators, using a rewrite production instead:
which explicitly builds the tree using a tree grammar; recall that formal grammars can be used to specify both the generation and recognition of a language. Think of the rewrite rules as generational grammars. The important observation is the notion of the input grammar to output grammar syntax/semantics. I just looked again at TXL, a source translation system. It also has a nice "this becomes that" (using Cordy's terminology) syntax. TXL goes parse tree to parse tree, however, and I'm going from tokens to AST or, as you'll see later, back to tokens. I'll do a more thorough comparison later when I do academic papers. This is actually much easier to understand than the operator-based solution because you are providing an examplar or template that says precisly what will be built. Operators force you to imagine the emergent behavior of the rule. How does this work? All recognition elements on the left are
"remembered" by collecting them into lists and then you can refer to
them in the tree template. For example,
collects all the ID tokens in a list and then adds them as children
of the artificial token node ARGS. Please note that the rewrite
rule is exactly the tree grammar you would use to recognize the tree
built by this parser. The tree looks like:
What if you want a list of ARG trees:
Simple. Just pull the artificial token into the (...)+ loop of the
rewrite rule:
What if you want to treat different references to the same element
type differently? Just use a label on the elements. For example,
here is how to build a tree from a text Lisp-like tree description:
where attributes are prefixed with @ even in rewrite rules for consistency. It also makes it clear that you may be referring to an attribute defined in a shared scope like @symbols.names rather than an input grammar element. Referencing expr by itself means all trees matched for expr references throughout the production. To be clear, referencing @children within an action of the input grammar still only means the result of the most recent reference to rule expr. Imaginary NodesIn general, referencing a token type such as ID is actually shorthand for ID[] if ID is not referenced in the grammar to the left of the ->, which would get translated into something like factory.create(ID). You might also want to set the text of the imaginary token with ID[x] or DECL["DECL"] so the tree printing routines will be able to print something useful. What happens when you need to make an imaginary node that also happens
to be a grammar element on the left? The following makes a flat tree
with the parentheses doubled:
So in this unusual case we need to say LPAREN[] to make it clear we are creating an imaginary node. Setting the token type of a nodeIn the Java grammar for 2.x there are about 30 places where I do this:
which is really ugly. Using operators may not be possible in this
case if we want to remove the setType calls. Potentially, we could
do this:
But then what if you want the text for that BLOCK to be "BLOCK"?
which starts to get muddy. In this case, I would suggest:
I think it is much more clear. What if you want to create an imaginary token but use the line info
and such from a real token that will not get put into the tree like
'{' to BLOCK?
Perhaps the factory and do the appropriate translation. Actions in rewrite rulesSometimes you'll want to be able to reference an arbitrary expression
to add a node to a tree:
This makes a tree with the result of evaluating a target language expression as the payload for root and with A and B and children. Rewrite specs in subrulesYou can rewrite the productions of a subrule as well. Be aware that the -> rewrite operator turns off automated tree construction operators for everything in that production and below. Reconsider our decl rule. We can use a subrule rewrite to get the
appropriate tree:
In this case, each iteration of the (...)+ loop creates a new subtree according to the rewrite rule and then that result is automatically added as a child of the "var" node. Rewrite rules at an outer layer override rewrite rules below them because there is no way to refer to a partial tree unless I introduce a way to label a subrule, which is possible. So, currently, you'll get an error if you have nested -> rewrites. Hmm...after rereading this, I think maybe I should disallow rewrite rules for anything but the outermost prodution. The goal is a simple and obvious "here is what the rule generates" and subrule rewrites again force you to imagine what the emergent behavior will be since it behaves like an operator and in conjuction with operators at any higher levels. Ok, so a day later, I see the need for -> rewrites in subrules, but
they set the subtree for the entire rule. Here is the reason:
We need to set the resulting tree differently depending on the
subrules. This may not always work when you have two subrules in a
row like
There may be trouble specifying a different tree depending on syntax. Integration with attributesIf you refer to an attribute, x.y, in a tree rewrite rule, it
becomes the payload object of a tree node. For example, the following
rule creates a single AST node whose payload is the text of the ID
token:
whereas
or just plain
gives you a single AST node whose payload is the ID token itself. References to rule labels gives you the AST subtree(s) created for that
rule reference or references just as a token label gives you the
token(s) matched for that token; e.g.,
gives you a flat list of expressions and is equivalent to:
Rewrites dependent on semanticsUse semantic predicate notation to specify multiple rewrites according
to various semantic conditions:
NotesThere is no way to refer to the "current tree" or whatever since -> specifies what the tree is and all automatic construction is off. There is no way to use rewrite rules to do the operator trees for expressions without referring to the "currently constructed" tree so the tree operators are mandatory for expressions. By construction, there are no side-effects of these rules. They just set the return AST for the rule. ImplementationAST nodesAST nodes will be simple objects with a payload field and a list of children--no more child-sibling lists. Rather than have heterogeneous nodes to hold different bits of data per node, just point at a different payload object. The tree should literally just represent structure, you can put whatever you want in there. This is pretty cool because then there is exactly one definition of an AST node to do anything you want. One of the problems with ANTLR 2.x trees is that you can create cycles where a child pointer actually points back up the tree somewhere to a parent. Trying to walk that tree even to print it out will result in an infinite loop. Once we move the user's payload outside of the tree, multiple nodes can point at the same payload--you don't have to point at the same tree node, thus, likely introducing a cycle. I'm getting more and more into separating/encapsulating concerns. The AST tree node should show structure between payload objects. The user's application defines a single or multiple payload objects that the tree can refer too. So, literally the Common AST node may be:
The most common payload will be a Token object. ASTs point into the token stream and tokens point into the char stream. We are creating more and more sophisticated views/perspectives on the char stream as we lex, parse, build trees, etc... Heterogeneous nodesI don't like heterogeneous tree nodes: you should separate the model from the controller; just good design. If you put your action methods in model by having different node types all overriding action(), for example, it is a "one trick pony". You can only make one pass over that tree and nobody else can use the tree really. Multiple passes are the name of the game which would force you to have your model entangled with multiple methods representing multiple tree passes. Further, your controller is then spread across many many files; bad encapsulation. Passes should be grouped/encapsulated by pass such as "build symbol table" not by tree node. You don't want the "DECL node" pieces of 5 translation phases all in one file. It's the wrong way to slice things. You need to get your controller out of your model--either use a visitor or, more properly, a tree parser which is really just a sophisticated visitor that happens to verify the structure rather than doing a dumb depth-first-walk. I like the idea behind SableCC's approach of using types to guarantee tree structure by construction, but I'm opposed to using the heterogeneous trees to have action() methods for performing translation. Besides, using a tree parser checks your structure as it walks. Also note that a tree grammar can tell you when your tree structure is ambiguous, whereas a bunch of types cannot do that! So, in summary, I am fairly certain to drop heterogeneous tree support for 3.0. Naturally, using actions, you can still build these hetero tree if you want. Rewrite rulesI like rewrite rules because it moves all tree construction for a rule into a single "postfix" action carried out after all information is collected by the parser. In the presence of errors, this may make tree construction much more predictable and manageable (since it's in one spot). Rewrite rules should be encoded in code rather than using a template
library like StringTemplate does for text. For example,
would generate roughly (for Java):
For repeatedly-matched elements, I'll have to build a list:
Naturally, I have no idea how to do all of this yet, but it shouldn't be too hard. It looks like the inverse of building a parser. I like having the two input and output concerns separated instead of entangled. Future thoughtsI am looking at string templates and token streams as rewrites also.
For example, there is no reason why a parser cannot act as a token
stream filter by reading a token stream and spitting out a modified
one. A token stream rewrite is just a flat tree rewrite. E.g.,
spits out the ID tokens back out without the comma. This may be a more sophisticated version of the
TokenStreamRewriteEngine object in ANTLR 2.x because you can insert
text and such easily:
Nice, easy translation based upon syntax...back to text preserving all the whitespace and other hidden tokens in any rule that doesn't do a rewrite. I dig it. Core dumpThis is stream of consciousness when I was thinking about trees; may
give you an idea of how I derived everything.
same as
^(PLUS plus_expr mult_expr) means "make a tree with old tree as
left child". There is an implied:
so repeated invocations will be taller and taller tree? Ok, operators ^ and ! are really useful, but let's try using =>.
where option will be set multiple times by the loop? Hmmm...I might
prefer:
to make it clear it's a template to fill in. What if we wanted an
OPTION node on top?
Boy, that would freak people out. Now here's another rule with a list of stuff and you want OPTIONS on top:
I'm coming to the conclusion that the references in the ^(...)
templates should be in a totally different namespace that the
references in the grammar. Too confusing. Holes should be <...> I
think and others like OPTIONS are literals that create nodes. OTOH,
the following is very convenient:
where you want the actual PLUS token found as the tree node payload
since it has file/line info. Seems that naked PLUS has to be a
literal: an imaginary node in the tree case. To reference the actual
PLUS token, have to set it like anything else? Hm...wait. What if
anything in <...> is a reference to a label or token or rule reference
in the grammar. If it's not the same name then it's a new attribute
you can set.
So for rule:
you'd get an error:"ambiguous reference to rule result 'option' in tree ctor". What about
This means that o is set multiple times to the result of calling
option. Uh oh. Is it the tree or the text or what of a rule
reference? Ah. o=option would be different as it sets o to the
attribute scope of option. Since @o is a tree attribute, you get the
tree attribute o.tree. nah...@o should only be in actions.
yeah, but then it's different notation than StringTemplate!
Heh, let's look at @o again:
why be different than ST though? We'll want => "..." too. Actions are @x.y rather than <x.y> because that is unique. What if we made it @<x.y>. Nah. These are not templates. Actions are @x.y. Templates are <...>. If you want to refer to an attribute
So @options refers to the grammar attribute outside the template. A simple template ref of <options> would not find any value. Damn! Very complicated. Try action approach:
or
where "options" is the result of the subrule and ^(...) outside of action just adds a child to current tree. Oooh. That's like Loring's stuff i think. What about where you want to use ^ cause it's easy, but can't since
you have two root nodes?
Try this:
So list:(ID ':'^ type)+ is a subrule or embedded rule. Has => overall result or ^{...} actions, which loop and execute multiple times. Too subtle with either list= or list:. Booo!
Well, I guess the shorthand is just for when it really just works in
an obvious way. Can mix I suppose:
But, can easily build via:
where decl refers to tree already built? Doesn't work. No tree
built! Have to label everything:
So, ID and type refer to the tree for them but @v is the attribute
(implicitly @v.ast) created for the loop? Boy, I sure don't like
adding that @v in there. Would plain v work? That might be better.
@ is only to highlight in an action.
Here "foo" ref is automatically created tree as it has no => spec. What about changing the token type? Hmm...make a new literal node
note that the charSetElement is considered unique as it picks the closest ref inside the (...)* scope. Can override this with a label. But, how do you get line information? Set manually? Ugh, but what else to do? Actually perhaps a new operator
^ defaults to your scope but ^^ means affect the overall tree!
Heh, that is awesome!!!! So it means more typing for each
expression rule, but many ^ in a loop want the local scope. Perhaps
make ^ the overall tree root and another symbol the "local root".
^^?
Seems like ^ should be local tree and ^^ should be overall, but that probably means more overall typing. How to add imaginary node. Currently:
perhaps
but => turns off auto construction. :( Ah. Wrap in a subrule?
Ugly... How about:
where element+ means to refer to all values of element? For decls,
actually don't want any actions messing with trees! Always in => or
with operators. For
we want:
Works great. So any tree ^(...)+ with a + or * on it is duplicated n
times where repeated grammar elements are matched n times. all
multi-valued elements must be same n size. Multi-valued elements must
be grammar elements matched multiple times or an attribute set
multiple times (e.g., the @decl case above). Repeated elements are
considered single-valued elements to any enclosing ^(...)+ loop.
repeats entire ^(B ...) tree for each value of c and D matched. If
no values matched then the tree is ^(B a).
Here, only the subtree is generated multiple times for each c and D. Exactly like a tree grammar. So, we specify a parser grammar and then the corresponding tree grammar production to match the resulting trees. Beautiful! Skip the middle man and automatically produce the tree grammar w/o having to generate it. Might still use/need Loring's tree grammar generator for operator-based generation. How will we handle references to labels to differentiate which element
refs? For example, how to build a tree from a text tree description?
No need to use @root or @children as meaning is clear. A ref to an element means just the last value. A ref to element+ means ALL values created for it. If you say just element not element+ and there are multiple values, then you get a runtime tree construction error. Should there be both * and + ???
A reference to expr gets all values even if some are labeled and some are not as in the above. In this case, the expr ref in the => section would produce as root the last child. I like this => section as it happens at a very specific time: after all elements in that rule have been matched and it doesn't muck up the recognition part. Hmm...for recursion such as:
what does r refer too? It should be the referenced element, but how do you refer to the current tree for a rule? We need this because there can be many subrules with => specs and they may want to set the root. May not be able to. How to build expression trees with => rewrite specs?
That currently only sets the root of the (...)* block and hence would be a sibling of left. Also, how do you reset the value of left so you can build a taller and taller tree? Ah, what if ^^(...) means set the overall tree and ^(...) sets the
local? (or reverse because make overall root is the most common
thing!)
After one iteration, that would make the proper ^(PLUS left right) tree for the whole rule. Section iteration would lose the previous iteration's PLUS-rooted tree, however. What if we allowed assignment in the spec:
Blows side-effect-free characteristic, but perhaps it's ok. What about @expr.tree as an attribute
The => is scoped in the (...)* so the first mul_expr becomes @expr.tree right away and the => implicitly assigns to it. Wow. I dig it! Still, the side-effects make me really nervous. Goal: be able to build most trees outside of actions so we can interpret. :) How to handle different trees depending on actions?
Nice and neat, but noninterpretable; oh well. In interp mode, just pick the one w/o a pred I guess. Later for text templates, we should be able to do this:
Cool. For lists of stuff can do
elements in the grammar become incoming attributes to the "unparser"
template. When you reference a template in a stg file, it would be:
Hmm...a little weird when the arg and element are same name (e.g.,
"ID=ID") so I added the @ to @type to make it clear what is going on.
A little strange as I don't normally require @ outside of user
actions. Ack! Oh, maybe:
which is totally clear; wait, same problem. I use just <type> above to reference incoming list of type values and <ID> to mean all the IDs. See what the ANTLR v3.0 code generator would look like using this notation. Probably lots of predicates. @ID doesn't make sense as ID is a grammar element not an attribute or
label. :( Will just have to accept:
Perhaps it automatically pulls args from the list of matched elements if they match with the template def says. Means I need to load the template group in ANTLR. Try to reduce the jar dependency though through dynamic loading. OK, it should be @foo for any reference to a label or attribute or
whatever even if inside ^(...) so it's clear always what is a grammar
element and what is a label/attribute. For example, you might want
trees created from values not trees coming out of exprs.
BTW, tree grammars will look like:
November 29, 2004Random thoughts:
November 10, 2004After Loring bitched at me on the phone yesterday <snicker>, I'm beginning to think he's right: if separating the node data from the navigation is the right concept, then trees should be a single implementation that simply carry a "payload" object defined by the user. This is like Sun's MutableTreeNode. The interesting thing is that the tree nodes themselves are all
homogeneous, but the payload object could be totally different at each
node. So, the parser creates exactly one kind of node with a variety
of payload (or a single) types. The payload can be defined in the
grammar itself so that ANTLR can generate the required type(s).
Then ANTLR would generate P_AST type for use when building trees. For
heterogeneous trees, we currently allow type names in the tokens
section:
If we allow field specifications, that is better than specifying a
type name maybe:
Whatever. A random idea. As a bonus to the payload strategy, we can enhance the tree functionality later w/o forcing alterations in people's application; their payload objects still fit in our nodes. This all comes at the cost of an additional object creation (payload creation + node creation). Side note: Mitchell suggested Tokens and Tree nodes should have not only fixed fields like this, but the ability to dynamically acquire "attributes"; this would basically be a hashmap. It cuts down on a bazillion subclasses. It would be useful when parsing xml for example. The TAG token could have a list of tag attributes w/o creating subclasses etc...
November 9, 2004I originally chose child-sibling trees because they are very efficient
in terms of memory, but they have a "different" structure than the
way we normally draw trees. For example tree #(PLUS 3 4) is drawn
like this:
but stored like this:
I had the AST structure:
But there is a huge problem with this: a tree pointer not only points at its subtree, but points also at it's next sibling when it's in a list of children. Further, the previous child points at it. Essentially, the data at a node is mixed up with the structural information. Moving a node is complicated and we often end up with unintended cycles and other mixed up structures. The data structure also does not make it easy to doubly-link it, which is very useful during translation. For 3.0: I am going to alter the basic concept here. Each node will
have a list of children the list will be independent of the nodes such
as an array of pointers. Copying a reference to a tree is easy, just
do the pointer copy--you don't have to worry about previous or next
pointers. Each node will have a link to its parent. Given a random
tree pointer, however, it will be a linear search to find the next
sibling. You have do this:
Now that does not say we cannot have multiple interfaces over this
basic data structure. I'm just saying that the implementation will
make it much easier to do tree manipulations. I passed this by John
Mitchell and he seemed to agree; so I'm sane within the bounds of
John's sanity. ;) Perhaps Tree looks like:
The immediately obvious problem with this interface (and ANTLR 2's
AST interface) is that it collides with DOM trees--they have the
same method names, but DOM as a return type not Tree. I suppose I can
just use different names so a tree implementation could "implements"
both DOM and ANTLR's Tree. In Old English, "sibling" was "sibb".
What about the dreaded "brother" or how about "kindred"? "kin"?
"kinsman"? "Neighbor" (good but icky to spell)? Thesaurus list
nothing else interesting. How about getNext(), getPrev(), and
getChild()? Oh, how about the more direct: getUp, getDown, getLeft,
getRight? Yeah, that works. :) How about this?
Question: should null be allowed as a valid child list item? It would be a convenient means of doing a delete. I think it should not be allowed because then getNextSibling() won't work. If you get null back you couldn't move to the next sibling (how do you search for null if there are multiple?). A further restriction is that you should not add the same tree more than once if you want the getNextSibling() stuff to work. This would be too expensive to enforce probably. November 30, 2003Transformations with array of ast down/right pointers instead of just one. To copy a tree node from pass i to j is just right[j] = right[i] etc... Adding, deleting nodes from phase to phase is ok as links are in separate planes so won't affect previous planes. You don't have to wipe out old pointers to move a node into new phases as old pointers are not seen. Dramatic reduction in number of node allocations as node n is reused from pass i to pass j if it is not altered. Do we have to find a way to force setXXX() methods on trees to dup before write so that previous phases aren't affected? Nah, don't really care about history unless you want to roll back. Token stream is perhaps at phase / plane 0, a flat tree of the input tokens. Might be useful for unparsing to text (in target language) :) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||