Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



ANTLR Code generation

RSS Feed

March 25, 2006

Updated June 3, 2006 as I try to finish implementation; added eot/eof tables.

Currently I generate a bunch of class definitions in order to build the state machines for lookahead prediction. The problem as that you end up with hundreds of .class files in the directory, which is highly annoying. Further, it requires a huge amount of time and to load these dynamically a runtime. The more traditional state machine implementation has a transition function as follows:

newstate = transition[currentstate][currentinputchar];

It is very fast and uses a very dense representation of the DFA graph (compared to the horde of classes I generate now). It avoids method calls too, which are very expensive in a polymorphic language. The reason I did not use those for ANTLR is because of unicode characters. For n states, it requires n*65536 integers for the obvious implementation. Also, some state transitions require semantic predicate expressions that cannot be encoded in this simple strategy.

I have now a very simple compression strategy that should be able to use the above strategy for implementating the vast majority of the states, only moving to a switch statement when predicates are required or the strategy is unable to compress the data far enough to fit in the array.

This strategy requires the following arrays:

  • transition[state][c]. The classical transition function to move from one state to another in the DFA.
  • min[state]. The minimum character value for any edge emanating from this state.
  • max[state]. The minimum character value for any edge emanating from this state.
  • eot[state]. State to state map. If >=0 it indicates that this state has an EOT (end of token) edge; this could be to an accept state and usually is. This edge amounts to an "else"-clause. min[s]/max[s] never includes EOT (-2). Algorithm should jump to eot[s]. Used by the lexer.
  • eof[state]. State to accept state map. If >=0 it indicates that this state has an EOF (end of file) edge that points to an accept state. min[s]/max[s] never includes EOF (-1). Algorithm should jump to eof[s] and then accept. Used by the parser.
  • accept[state]. If this state is an accept state then the result of this computation is the predicted alternative within the decision. 0 implies the state is not have accept state.
  • special[state]. Indicates whether or not this state has complicated edges where complicated means either there is a semantic predicate or the character range was too wide to stay within the normal transition array. That is, when max[state]-min[state] is larger than some threshold ANTLR will use a switch or if-then-else rather than creating a huge array. If this value is >= 0, then this state is special and, importantly, gets mapped to a new state number. This state number is part of a sequence of compressed state numbers so that the target language generator can generate a very efficient "switch on state". All state numbers will be contiguous and be offset from 0.

I have not tried this out, but the DFA simulation algorithm should look like the following:

match(startState) {
  s = startState;
  c = read();
  while (accept[s] < 1) {
    s' = special[s];
    if (s'<0) { // normal state
      if (c>=min[s] && c<=max[s]) {
        s = transition[s][c-min[s]];
        c = read();
        continue;
      }
      else if ( eot[s]>=0 ) {
        s = eot[s]; // tells us where to go
        c = read();
        continue;
      }
      else if ( eof[s]>=0 ) {
        return accept[eof[s]];
      }
      throw noviablealt;
    }
    // complicated states (generated by ANTLR)
    switch (s') {
      case 0 : // compressed state number
        switch (c) {
          ... // edges of state special[s]
          default :
            throw noviablealt;
        }
      case 1 : // compressed state number
      ...
    }
    c = read();
  }
  return accept[s];
}

In the Mantra language parser there are about 700 lines out of 8726 of class definitions for DFA states. In the lexer there were 2051 lines out of 5670. There are about 400 .class files in the directory just for DFA states (from 1 parser and 3 tree parsers). Ick.

October 17, 2005

A followup from yesterday. Ok, I yanked out all the automatic tree creation stuff from Java.stg into JavaAST.stg (well, the templates). Modified the CodeGenerator to look for <language>AST.stg if you use output=AST. I still needed to alter existing templates like alt:

/** An alternative is just a list of elements; at outermost level,
 *  set the AST parameters to indicate what kind of tree to build.
 */
alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<if(autoAST)>
<if(outerAlt)>
root_0 = (<ASTLabelType>)adaptor.nil();<\n>
<else>
<ASTLabelType> root_<blockLevel> = (<ASTLabelType>)adaptor.nil();<\n>
<endif>
<endif>
<elements:element()>
<if(autoAST)>
<if(!outerAlt)>
adaptor.addChild(root_<enclosingBlockLevel>, root_<blockLevel>);<\n>
<endif>
<endif>
}
>>

So it didn't have any AST stuff in it. First problem would be: get rid of autoAST parameter in Java code. Then need to override alt in JavaAST.stg. I tried

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<if(outerAlt)>
root_0 = (<ASTLabelType>)adaptor.nil();<\n>
<else>
<ASTLabelType> root_<blockLevel> = (<ASTLabelType>)adaptor.nil();<\n>
<endif>
<super.alt(...)>
<if(!outerAlt)>
adaptor.addChild(root_<enclosingBlockLevel>, root_<blockLevel>);<\n>
<endif>
}
>>

Which works, but I realized it duplicates the curlies on the outside so then I decided I needed to leave holes in the Java.stg version:

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<declarations>
<elements:element()>
<cleanup>
}
/
>>

Hmm...it will get an error when those holes have no values... Define default parameters:

alt(elements,altNum,description,autoAST,outerAlt,
    declarations="", cleanup="")
 ::= <<...>>

Looks good except for the JavaAST.stg case:

/** An alternative is just a list of elements; at outermost level,
 *  set the AST parameters to indicate what kind of tree to build.
 */
alt(elements,altNum,description,autoAST,outerAlt,
declarations={
<if(outerAlt)>
root_0 = (<ASTLabelType>)adaptor.nil();<\n>
<else>
<ASTLabelType> root_<blockLevel> = (<ASTLabelType>)adaptor.nil();<\n>
<endif>
}, cleanup={
<if(!outerAlt)>
adaptor.addChild(root_<enclosingBlockLevel>, root_<blockLevel>);<\n>
<endif>
}
) :: "<super.alt(...)"

Gross! You can't see the template defs as default arguments! Ok, so I decided we needed to try inheritance for this:

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<alt_declarations()>
<elements:element()>
<alt_cleanup()>
}
>>

alt_declarations() ::= ""

alt_cleanup() ::= ""

That is pretty annoying with the "alt_" prefix to make it unique and that you have to define them even though they are empty! Then in the subgroup, you do this:

alt_declarations() ::= <<
<if(outerAlt)>
root_0 = (<ASTLabelType>)adaptor.nil();<\n>
<else>
<ASTLabelType> root_<blockLevel> = (<ASTLabelType>)adaptor.nil();<\n>
<endif>
>>

alt_cleanup() ::= <<
<if(!outerAlt)>
adaptor.addChild(root_<enclosingBlockLevel>, root_<blockLevel>);<\n>
<endif>
>>

Note the alt_ is like the scope override for a method definition in C++.

Ok, the problems with this straight inheritance scheme:

  1. define a bunch of empty methods
  2. have to follow a prefix scheme to get unique "holes" for subgroup templates to fill in
  3. there could be a disconnect between group and subgroup overrides w/o a more formal approach

Maybe we need the "aspect" approach after all. Currently with inheritance, we define a group with new templates and template overrides. What if we could have finer granularity than template? We could define regions of a template that could be overridden but the syntax could be nasty:

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<region(declarations)>
<endregion>
<elements:element()>
<region(cleanup)>
<endregion>
}
>>

Most of the time the region would be empty. Seems like a special kind of hole is what we want:

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<@declarations>
<elements:element()>
<@cleanup>
}
>>

These are essentially attributes that don't generate an error if they have no value and we could have special syntax to set them in a subgroup. Now a subgroup could define new templates, define template overrides, and define supergroup template aspects:

@alt.declarations ::=
<if(outerAlt)>
root_0 = (<ASTLabelType>)adaptor.nil();<\n>
<else>
<ASTLabelType> root_<blockLevel> = (<ASTLabelType>)adaptor.nil();<\n>
<endif>
>>

@alt.cleanup ::= <<
<if(!outerAlt)>
adaptor.addChild(root_<enclosingBlockLevel>, root_<blockLevel>);<\n>
<endif>
>>

It's like overriding template regions or setting the aspects of a template. I suppose you could allow setting of aspects even within a group:

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<@declarations>
<elements:element()>
<@cleanup>
}
>>

@alt.declarations ::= "int count=0;"

This would be identical to

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
int count=0;
<@declarations>
<elements:element()>
<@cleanup>
}
>>

except that a subgroup could not remove the "int count=0;". When @alt.declarations is a separate aspect, the subgroup could remove the decl:

@alt.declarations ::= ""

or add to it:

@alt.declarations ::= <<
<...>
AST root = null;
>>

where <...> would mean "previous aspect value". Or,

@alt.declarations ::= [...,"AST root = null;"]

using the "list creation" syntax. Seems cleaner.

Just to be clear, any aspect definitions in a group are not overridden in a subgroup. Just like constructors invoke their super ctors before they do any work, the supergroup templates must be complete before they are overridden; complete in the sense that all aspects within that group have been applied. Then the overridden method can ref super.template() or whatever normally. The subgroup aspects are applied after the subgroup template has been created. Confusing language: will have to clean this up. Oh, and seems like code cannot set aspects as they are output snippets not data.

Regions

Ah! I just talked to Ric Klaren again...let's go with regions as it's consistent. You are just defining chunks of templates that you can override. All we need is some syntax. How about:

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
@declarations::={
int count=0;
}
<elements:element()>
@cleanup::={}
}
>>

Or, if we want angle brackets around it:

alt(elements,altNum,description,autoAST,outerAlt) ::= <<
// <fileName>:<description>
{
<@declarations>
int count=0;
<@end>
<elements:element()>
<@cleanup><@end>
}
>>

Can't have isolated <@cleanup> as it wouldn't know whether to seek an <@end>. Perhaps another tag for shorthand <@cleanup@>.

One can imagine an implementation of this as simply syntactic sugar that converted regions into actual templates so StringTemplate could remain the same. The parser group syntax parser would simply treat this @ stuff as a bunch of macros.

With regions we don't need special syntax to add to an existing region etc... Can just refer to <@super.declarations> or something similar to get the supergroup's region definition. Have to be careful about the syntax though so it's not ambiguous between define a region and refer to a region.

October 16, 2005

ANTLR2005 workshop is this week and Ric Klaren is in town early for code generation hacking! The code generation templates have become a complete mess due to the introduction of code gen aspects such as tree construction, debugging, etc... I brought this up and we discussed solutions.

First the problem. Each template by itself is fine like:

/** match a token optionally with a label in front */
tokenRef(token,label,elementIndex) ::= <<
<if(label)>
<label>=input.LT(1);<\n>
<endif>
match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);
>>

Then, I need to add functionality for a token ref with a += label. Then automatic tree construction with += label. Then tree rewriting with += label. Etc... The template got totally unreadable with all the IF conditionals! Ick. So, as much as I hated to do it, I started creating very specialized templates. I currently have the following huge list of templates to handle all possible combinations of label types and tree construction:

  • tokenRef
  • tokenRefAndListLabel
  • tokenRefASTBang (same as tokenRef)
  • tokenRefASTBangAndListLabel (same as tokenRefAndListLabel)
  • tokenRefASTTrack
  • tokenRefASTTrackAndListLabel
  • tokenRefASTAndListLabel
  • tokenRefASTRootAndListLabel
  • tokenRefASTRuleRootAndListLabel
  • tokenRefAST
  • tokenRefASTRoot
  • tokenRefASTRuleRoot

Note that I think we should still factor out the various operations like tokenRefASTRootAndListLabel as it represents a different grammar element rather than an option change like "build ast". But the template should say how it's different from a normal tokenRefASTRoot template for example (with whatever mechanism we choose) to avoid duplication in the templates (maintenance nightmare). It's only 93 lines of template code, but it's the repeated template snippets within templates that is really bad...I suppose the proliferation of template names is a necessary evil, though unpleasant.

We discussed using template inheritance but we thought it would be almost as cumbersome:

/** match a token optionally with a label in front */
tokenRef(token,label,elementIndex) ::= <<
<tokenRefBEGIN()>
<if(label)>
<label>=input.LT(1);<\n>
<endif>
<tokenRefMIDDLE()>
match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);
<tokenRefEND()>
>>

tokenRefBEGIN() ::= "" // do nothing extra by default
tokenRefMIDDLE() ::= ""
tokenRefEND() ::= ""

A subgroup would then override tokenRefBEGIN template etc...

What we need is some simple form of composition so we thought about using an "aspect" like mechanism. The "AST aspect" would fill in holes in the tokenRef template by just setting attributes:

tokenRef(token,label,elementIndex) ::= <<
<begin>
<if(label)>
<label>=input.LT(1);<\n>
<endif>
<middle>
match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);
<end>
>>

That would be the normal template and in a separate aspect file you'd say

aspect AST;

// re-use map syntax to set the attribute holes for this aspect
tokenRef ::= [ begin:"blort", end:"blech"] // omit middle for fun

You'd register an aspect on the normal group something like this:

StringTemplateGroup java = ...;
StringTemplateAspect AST = ...;
java.addAspect(AST);
...
StringTemplate st = java.getInstance("toptemplate");
String output = st.toString();

Executing toString() would first get the template and then apply all aspects, setting begin and end attributes before actually rendering. It would behave as if you had simply done these extra lines:

st.setAttribute("begin", "blort");
st.setAttribute("end", "blech");

before the toString().

That was a really interesting approach until I realized that we need to do more than modify existing templates--we need to add new templates. You don't have the normal Java output templates to know anything about the need for template tokenRefAST() for example. That template should be defined in a different group.

Again, we're back to inheritance. You need to have group JavaAST inherit from Java group to add templates...we could define aspects to allow new templates, but then it's duplicated functionality. Ok, let's try again with inheritance, but using the default parameter functionality to refine and set parametes. I use this default parameter thing already for making parser and treeParser refine or override or whatever you want to call it like this:

/** How to generate a tree parser; same as parser except the input
 *  stream is a different type.
 */
treeParser(name, scopes, tokens, tokenNames, globalAction, rules,
cyclicDFAs, bitsets, labelType={<ASTLabelType>}, ASTLabelType="Object",
superClass="TreeParser") ::= <<
<genericParser(inputStreamType="TreeNodeStream", ...)>
>>

parser(name, scopes, tokens, tokenNames, globalAction, rules,
cyclicDFAs, bitsets, ASTLabelType, superClass="Parser", labelType="Token") ::= <<
<genericParser(inputStreamType="TokenStream", ...)>
>>

Ok, now we need to define what a token reference with += label looks like. Here is what it looks like now:

/** ids+=ID no AST building */
tokenRefAndListLabel(token,label,elementIndex) ::= <<
<label>=input.LT(1);
match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);
if (list_<label>==null) list_<label>=new ArrayList();
list_<label>.add(<label>);<\n>
>>

which share identical snippet:

<label>=input.LT(1);
match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);

with tokenRef. Seems like we want to define tokenRefAndListLabel as it differs from tokenRef. This has nothing to do with trees so should be in the same template group. Together the two would look like:

/** ID no AST; may or may not have a label */
tokenRef(token,label,elementIndex) ::= <<
<if(label)>
<label>=input.LT(1);<\n>
<endif>
<middle>
match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);
>>

/** ids+=ID no AST building */
tokenRefAndListLabel(token,label,elementIndex) ::= <<
<tokenRef(...)>
if (list_<label>==null) list_<label>=new ArrayList();
list_<label>.add(<label>);<\n>
>>

where I have reused template tokenRef by simply referring to it and using (...) as the arg list to say "allow tokenRef to see all of tokenRefAndListLabel's parameters. I also don't need begin and end holes because I can always surround the other content with literals as I've done here.

What about altering the parameters or defining what goes in a hole? The following template adds an 's' to the label tokenRef will use to define the label and it sets the middle hole to be extra:

tokenRefAndListLabel(token,label,elementIndex) ::= <<
<tokenRef(label={<label>s}, middle="extra", ...)>
if (list_<label>==null) list_<label>=new ArrayList();
list_<label>.add(<label>);<\n>
>>

So, within a group, I can define one template as it differs from another pretty easily.

That is within a group. What about a new group like JavaAST where we want to encapsulate all AST stuff? You want to perhaps:

  1. augment a template
  2. replace a template
  3. fill a few holes in a super template
  4. add a new template

To add a template in a subgroup and define it as it differs from another is easy and looks the same as tokenRef vs tokenRefAndListLabel:

/** match a token when output==AST */
tokenRefAST(token,label,elementIndex) ::= <<
<tokenRef(...)>
<label>_tree = adaptor.create(<label>);
adaptor.addChild(root_<blockLevel>, <label>_tree);
>>

We are saying that to add this token to an AST, add that stuff after a normal tokenRef. Actually, it seems we could also override tokenRef to have the AST stuff when we use group JavaAST:

/** match a token when output==AST */
tokenRef(token,label,elementIndex) ::= <<
<super.tokenRef(...)>
<label>_tree = adaptor.create(<label>);
adaptor.addChild(root_<blockLevel>, <label>_tree);
>>

That makes sense. Remove the "AST" component of the name as we're now in the JavaAST group. This will simply the code generator code also because it doesn't have to add or omit that word when computing a template name.

Ok, I just refactored all the templates. Dropped from 93 to 18 (normal) + 59 (AST) == 77 lines, but there is essentially no duplicated template content! 20% drop represents the duplicated content.

The normal template file looks very clean now for tokenRef:

/** match a token optionally with a label in front */
tokenRef(token,label,elementIndex) ::= <<
<if(label)>
<label>=input.LT(1);<\n>
<endif>
match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>);
>>

/** ids+=ID no AST building */
tokenRefAndListLabel(token,label,elementIndex) ::= <<
<tokenRef(...)>
<listLabel(...)>
>>

listLabel(label) ::= <<
if (list_<label>==null) list_<label>=new ArrayList();
list_<label>.add(<label>);<\n>
>>

where listLabel is reused in the JavaAST subgroup:

/** ID! and output=AST (same as plain tokenRef) */
tokenRefBang ::= tokenRef

/** ID and output=AST */
tokenRef(token,label,elementIndex) ::= <<
<super.tokenRef(...)>
<label>_tree = adaptor.create(<label>);
adaptor.addChild(root_<blockLevel>, <label>_tree);
>>

/** ID^ and output=AST */
tokenRefRoot(token,label,elementIndex) ::= <<
<super.tokenRef(...)>
<label>_tree = adaptor.create(<label>);
root_<blockLevel> = (<ASTLabelType>)adaptor.becomeRoot(<label>_tree, root_<blockLevel>);
>>

/** ID^^ and output=AST */
tokenRefRuleRoot(token,label,elementIndex) ::= <<
<super.tokenRef(...)>
<label>_tree = adaptor.create(<label>);
root_0 = (<ASTLabelType>)adaptor.becomeRoot(<label>_tree, root_0);
>>

/** ids+=ID! is same as tokenRefAndListLabel */
tokenRefBangAndListLabel ::= tokenRefAndListLabel

/** label+=TOKEN when output=AST but not rewrite alt */
tokenRefAndListLabel(token,label,elementIndex) ::= <<
<tokenRef(...)>
<listLabel(...)>
>>

/** Match label+=TOKEN^ when output=AST but not rewrite alt */
tokenRefRootAndListLabel(token,label,elementIndex) ::= <<
<tokenRefRoot(...)>
<listLabel(...)>
>>

/** Match label+=TOKEN^^ when output=AST but not rewrite alt */
tokenRefRuleRootAndListLabel(token,label,elementIndex) ::= <<
<tokenRefRuleRoot(...)>
<listLabel(...)>
>>

/** ID but track it for use in a rewrite rule */
tokenRefTrack(token,label,elementIndex) ::= <<
<super.tokenRef(...)>
list_<token>.add(<label>);<\n>
>>

/** ids+=ID and track it for use in a rewrite rule; adds to ids *and*
 *  to the tracking list list_ID for use in the rewrite.
 */
tokenRefTrackAndListLabel(token,label,elementIndex) ::= <<
<tokenRefTrack(...)>
<listLabel(...)>
>>

May 8, 2005

On Friday, I had a beautiful "StringTemplate code generation moment" when working on ANTLR v3.

Background on multiple phase translations

Computer language translation is a matter of making 1 or more passes over the input to collect information and then generating the necessary output with a final pass. Only the simplest, syntax directed, translations get away with a single pass over the input integrating information collection and generation. For example, the simple notion of forward references causes lots of trouble, requiring at least two passes over the input to discover references to undefined language elements like functions. Older languages such as C and Pascal require the human programmer to declare functions before they are seen lexically in a program in order to simplify compiler construction.

Using a multiple pass approach is an excellent programming technique to simplify a difficult translation into manageable phases, but it often demands more complicated data structures to hold phase results.

Consider translating from a language that allows implicitly-defined variables like Fortran and Python to a language that requires definitions at the start of a block like C. Among all of the other work, it is nice to have a pass that explicitly defines any implicitly-defined variables. For example, in Python the following code

def foo(n):
    if n>1:
        a = 3;

would be translated to the following C code:

void foo(int n) {
    int a;
    if (n>1) {
        a = 3;
    }
}

The define implicitly-defined scalars phase would collect a list of all scalars without declarations by visiting all assignment statements. Later, in the final code generation phase, that list of scalars must be available and the generation phase must be changed every time this phase alters the scalar lists or other data structures in any way. It would be nice to isolate each phase from the others as much as possible.

Using StringTemplate in multi-phase translators

StringTemplate is to output languages as ANTLR is to input languages. In other words, a group of StringTemplate templates represents an output (generational) grammar. The StringTemplate engine feeds off of attributes while an ANTLR parser feeds off of a token stream. The two are duals of each other, which is how I showed (roughly) that even a template engine restricted to enforce model-view separation can generate the context-free languages. Translators should use a parser generator for recognizing input and a template group as a code generator.

Aside from the nice symmetry, using templates for code generation helps to isolate translation phases. In fact, there may be no final translation phase that combines the results from the internal phases, hence, there are usually fewer inter-phase dependencies. The outermost template's toString() method renders the template to text.

A StringTemplate group describes the overall structure of the output language and provides an infrastructure or scaffolding. A phase merely fills in some of the holes. For example, here is a template definition for C functions:

function(type,name,arg,decls,stats) ::= <<
<type> <name>(<arg:argument(); separator=",">) {
    <decls>
    <stats>
}
>>
argument() :: "<it.type> <it.name>"

The arg:argument(); separator="," notation indicates that template argument is to be applied to each arg attribute (which must be an aggregate with type and name properties) with a comma in between multiple values.

Aside from being extremely clean and informative, a StringTemplate template provides a form of lazy evaluation, an incredibly important characteristic that allows a template to refer to values that have not yet been defined. This permits a translator to compute values when it finds it efficient or convenient. An output sentence is represented via a highly-nested tree of templates and expressions (referring to attributes) analogous to a parse tree associated with input parsing.

The function template lazily evaluates expressions <decls> and <stats> in the sense that many phases can add declarations or statements as needed and in any phase order. The StringTemplate.toString() method forces evaluation of expressions, but until that time templates may refer to any value(s) they want. Phases may define the attributes in any order. The implicitly-defined scalar phase can walk the assignment statements directly adding to the decls attribute without affecting any other phases that need to declare variables.

Altering the past

Last Friday I was trying to implement an ANTLR feature where actions may refer to tokens or rules in a production if the reference is unique. For example, I wanted the following to work:

atom : ID {System.out.println($ID.text);}
     ;

The translation should behave as if the programmer had typed:

atom : x=ID {System.out.println($x.text);}
     ;

A non-StringTemplate-based translator might have trouble with this situation. The problem is that when the translator was translating the action, and discovered the $ID, it would already have generated code for the ID token reference. The output may already have been written to a file. By evaluating output early, the translator cannot affect previously-seen language constructs such as the ID.

The template for a token reference is simply:

tokenRef(label,token) ::= <<
<if(label)><label> = input.LT(1);<endif>
match(<token>);
>>

An isolated ID yields match(ID); and x=ID yields:

x = input.LT(1);
match(ID);

Using a template engine with lazy evaluation, a phase that translates actions may affect output for prior elements such as the ID reference by simply, in this case, defining the label attribute. The ANTLR ActionTranslator asks for the code chunk (a template instance of "type" tokenRef) associated with the ID reference and just sets the label attribute to a new, unique label name.

Information can flow the other direction as well. The ActionTranslator may ask if the ID reference already has a label associated with it. This simultaneously prevents multiple $ID references in the action from trying to create multiple labels on the ID and also allows the reuse of an existing user-defined label. The following ANTLR grammar

atom : id=ID {System.out.println($ID.text); String s=$ID.text;}
     ;

would be interpreted as

atom : id=ID {System.out.println($id.text); String s=$id.text;}
     ;

In other words, the generated code would refer to the id label not an unfamiliar automatically generated label conjured up by ANTLR.

Now that is cool.

October 27, 2004

I just reread the RE2C scanner generator by Bumbulis and Cowan. Generates hardcoded DFA like I intend to do. Tom Pennello did this for LR parsing back in the 80's. A very fast method in general. Must faster than simulating DFA with tables.

My notes from the paper:

Optimizations

They have an analysis showing IFs are better than SWITCHes sometimes depending on the density of the edge labels. When they generate IFs, they order the edge label values to make simpler expressions. Can also do binary search on the labels.

They use bitsets but they have 8 bit char vs 16 bit UNICODE. When the labels are latin char etc... though I can consider doing bitsets for complicated edge transitions.

Also, for chains of states (such as generated for "abc"), you can avoid a goto by generating stuff like this:

s0: if !a goto error;
s1: if !b goto error;
s2: if !c goto error;
    accept

Just let the CPU program counter fall through to the next state.

Buffering IO

While I think just about all applications will be able to buffer the entire input, some very large input will require buffering. They mention a "sentinel" technique where some illegal char like ASCII 0 signals when to reload. So if you buffer 1024 char, then add 0 to the end and then have a state transition on 0 for every state that goes to a special "buffer me again" state.

An easier mechanism is apparently to simply buffer lines since \n is unlikely to exist in the middle of a token (I've never seen it do that, so probably ok).

October 19, 2004

For the Java target of ANTLR 3.0, I've been concerned about the hideous nature of the code generated for DFA machines. Without a goto instruction in Java, you have to use a massive switch (slow as death) or use method calls (slow); plus you have to create like an object for each of thousands of states when the Lexer or Parser is loaded. Ick.

Sriram Srinivasan at the ANTLR2004 workshop suggested generating byte code directly. There is a goto bytecode of course. :) The beauty of these suckers is that the CPU program counter will be used (once compiled to native code) to traverse the states of the DFA. That will be insanely fast; about the same speed as the C DFAs Ric generates for 3.0.

So, I will generate byte code assembly for the DFAs from the ANTLR code generator using StringTemplate so people can tweak the byte codes easily. For example, you might want to inline some of the char fetch stuff. That will haul ass!

The only problem was that I didn't want to use BCEL or Jasmin or any other package since it means ANTLR depends on those packages (i hate that). So, I spent the weekend and yesterday building a .class file writer and then a byte code assembler. Both are tiny as they are very specific to my needs (only handles like 50 bytecodes). Nonetheless, 900 total lines of code is nice. :) An excellent price to pay to avoid other people's libraries.

Here is a sample (partial) state in my assembly code

    aload 0
    iconst 1
    invokeinterface IntegerStream.LA 2
    istore 1
    iload 1
    iconst 98
    if_icmplt x
    iload 1
    ireturn
x:
    iconst -1
    ireturn

I'm guessing that a minimal state will be about 20-ish bytes, allowing for a max of about 3000 states in any specific DFA. Not a serious limitation probably.

I also hope to use this code in my CS652 grad prog lang class whenever I teach that again. A nice real world problem.

June 14, 2004

Loring Craymer convinced me yesterday (supported by Ric Klaren's comments on the list today) that making some static objects is a better solution than a text-based DFA spec. Agreed. So, playing around today, I get the following possible solution:

1. Base DFA class with a trivial State definition that says how to transition:

    static class DFA {
        static abstract class State {
            public abstract State transition(int c);
        }
        int alt;
        public int predict(ANTLRCharStream input, State start) {
            int mark = input.mark();
            State s = start;
            alt = 0;
            while ( input.LA(1)!=ANTLRCharStream.EOF && alt<=0 ) {
                s = s.transition(input.LA(1));
                input.consume();
            }
            input.rewind(mark);
            return alt;
        }
    }

Then a specific DFA could be a singleton object derived from DFA. For example, here is a DFA predicting rule nextToken given rules:

nextToken : WS | ID ;
WS : ' ' | '\n' ;
ID : ('a'..'z')+ ;

Here is the specific DFA:

    // ' '|'\n' predicts alt 1 of nextToken
    // 'a'..'z' predicts alt 2 of nextToken
    static class DFA1 extends DFA {
        public int predict(ANTLRCharStream input) {
	    return predict(input, s0);
	}
        State s0 = new State() {
            public State transition(int c) {
                switch (c) {
                    case ' ' :
                    case '\n' :
                        return s1;
                    default :
			// note that I can have all sorts of wacky stuff
			// including predicates and unicode in this
			// transition method
                        if ( c>='a' && c<='z' ) {
                            return s2;
                        }
                        break;
                }
                return null;
            }
        };

        State s1 = new State() {
            public State transition(int c) {
                alt = 1;
                return null;
            }
        };

        State s2 = new State() {
            public State transition(int c) {
                alt = 2;
                return null;
            }
        };
    };

Note that this is not optimized in any way. For example, s1 and s2 are totally unnecessary. Of course, this LL(1) decision wouldn't even result in a DFA, but it illustrates the point. ;)

Anyway, finally, you create the singleton:

static final DFA1 dfa1 = new DFA1();

and then use it like this:

    public Token nextToken() {
        currentToken=null;
        text.setLength(0);
        switch (dfa1.predict(input)) {
            case 1 : rWS(); break;
            case 2 : rID(); break;
	    default : // error
		break;
	}
    }

This further does not take into consideration the optimizations I plan on doing for the lexer to avoid re-examining chars.

Size? Hmm...doesn't it seem that the simple DFA encoded as static objects is still pretty damn big even though it's mostly random java syntax crap and the actual generated bytecodes is probably small. Generating bytecodes would be the tightest, but not really readable. Can't use goto bytecodes as they won't decompile into Java. :(

June 13, 2004

Concerning the new parsers and lexers generated by ANTLR 3.0, I'm trying to decide how to encode the prediction DFAs. Most decisions are LL(1) in which case I can generate a normal looking thing like:

switch (input.LA(1)) {
	case IF : ifStat(); break;
	case BREAK : match(BREAK); break;
	...
}

However, when the lookahead gets k>1 (or possibly even arbitrarily large), I need to generate a state machine that figures out which alternative will eventually match. Consider that k fixed lookahead is really just a state machine of fixed depth k (i.e., exactly k states examined for each alternative and no loops). For arbitrary lookahead, clearly the DFA needs to allow loops etc... These can get sort of complicated looking in implementation, but will provide some dramatic power. For example, the following will freak out any fixed lookahead LL-based parser generator:

func
	:	type ID '(' args ')' ';'
	|	type ID '(' args ')' '{' body '}'
	;

Naturally, this isolated case could be left-factored but it would be less natural etc... ANTLR 3.0's "LL-Star" analyzer will gladly accept this at the cost of run-time speed. A DFA will zip ahead looking for the ';' or '{' that uniquely predicts which alt will eventually match. This is like automatic backtracking, but with a fast DFA instead of the full-parse.

My goal is to provide a tool that allows newbies to just type something in and not have it puke all over the place with ambiguity warnings. More experienced users can build more tightly tuned grammars. The more LL(k) it is, the faster it will go. :)

So, to my big question. Java doesn't have gotos which makes it a pain in the butt to generate decently looking state machines. You have to use nested switches like this:

while (true) {
    switch (state) {
        case 0:
            switch (c) {
                case 'a' :
                    state = 1;
                    consume();
                    break;
                case 'b' :
                    state = 2;
                    consume();
                       break;
                case 'c' :
                    state = 3;
                    consume();
                    break;
                case EOF :
                    break matching;
                    default :
                   error(c);
               }
               break;
       case 1:
       ...
    }
}

Eewww. Even in C, though, the DFA encoding could be rather large. We also have the problem in Java that methods can only be 64k and some DFAs would break that.

Instead, for non-LL(1) decisions, I wonder if we should just encode the DFA in a terse (serialized) syntax and then build the DFA in memory (instead of using freeze-dried code) at parser initialization time. The cost is linear in the size of the DFA so there is no serious cost. I am leaning more and more to having decent symbolic information at parse time such as the original grammar so that great errors and recovery could be done (development would be easier too). Building DFAs dynamically fits in this direction.

It turns out that I already have a readable DFA description syntax that I use for unit testing the analyzer. For example, for grammar g, the DFA would be the string in "expecting":

    Grammar g = new Grammar(
            "grammar t;\n"+
            "a : A B | A C;");

    String expecting =
            ".s0-A->.s1\n" +
            ".s1-B->:s2=>1\n" +
            ".s1-C->:s3=>2\n";

Graphically, the DFA would look like:

s0-A->s1-B->s2
      |
      ---C->s3

where s2 and s3 are stop states. So, the general form of a parsing decision would look like:

rule() {
  DFA decision = dfaFactory.get("...DFA description...");
  switch ( decision.predict() ) {
  case 1 : // alt 1
  case 2 : // alt 2
  default : // error
  }
}

The start up cost would be pretty small and the resulting DFA would be cached by the dfa factory. My main question revolves around readability. One of the big draws of ANTLR is its readability and I don't want to screw that up. One could argue that just the notion of a DFA is less readable, but from my own experience the key is being able to watch the parser route control flow in the debugger--not necessarily verifying the prediction expressions. In fact, the expressions are often references to sets that are totally opaque to the programmer.

In summary, the common LL(1) decisions can be handled with simple switch statements, but the more complicated DFAs should be handled not with hard-coded DFAs but with DFAs dynamically created from simple Human-readable descriptions. This has the added benefit that interpreted parsers and code generated parsers would be using the exact same mechanisms.

January 04, 2004

Thinking more about the parser intermediate language. Rather than generate a tree directly from the NFA+DFA stage, there are advantages to generating text and then converting to a ParseLang tree:

  1. It's easier to generate text than other trees usually (particularly from a tree parser...gets confusing).
  2. A text form would be human readable and serve as a verbose dump from ANTLR to explain the high level parser it will be constructing.
  3. Other (even non-Java) tools can generate this intermediate language and then take advantage of ANTLR's back end code generator to produce parsers.
  4. The text format would provide an excellent test point.

You could view this format as a program-like grammar with explicitly-specified lookahead computations. For example, the simple grammar a : A | B ; might look like this:

rule a {
  select {
    lookahead A => {match A;}
    lookahead B => {match B;}
  }
}

The lookahead could get rather complicated. Decisions would assume k=1 unless otherwise specified, but for k>1 you'd need options for approximate lookahead, full LL(k) lookahead, syntactic predicate lookahead, semantic predicates and their syntactic contexts, and finally LL-regular lookahead state machines. Here is a simple k=2 decision with an approximate and full LL lookahead decision:

a : A B
  | b
  | A C
  ;
b : X B
  | A Y
  ;

One could express this as:

rule a {
  select {
    lookahead(k=2) {A}, {B} => {match A; match B;}
    lookahead(k=2) {X,A}, {B,Y} & !{(A B)} => {produce b;}
    lookahead(k=2) {A}, {C} => {match A; match C;}
  }
}

rule b {
  select {
    lookahead X => {match X; match B;}
    lookahead A => {match A; match Y;}
  }
}

For syntactic predicates, the lookahead would look like another block to match:

// k=1; use pred to see k>1
a : (A B|C)=>A B
  | A C
  ;

You might gen something like this:

rule a {
  select {
    lookahead {A,B} & {
      select {
        lookahead A => {match A; match B;}
        lookahead C => match C;
      }
    } => {
      match A; match B;
    }
    lookahead {A} => {match A; match C;}
  }
}

Semantic predicates such as:

decl : typename ID | function ;
typename
     : {isType(LT(1).getText())}? ID
     | {isSpecial(LT(1).getText())}? ID
     | "int"
     ;
function : ID '(' ')' ... ;

have both a lookahead and action component:

rule decl {
  lookahead ID & {isType(LT(1).getText())}? |
            ID & {isSpecial(LT(1).getText())}? |
            "int" => {
    produce typename;
    match ID;
  }
  lookahead ID => produce function;
}
rule typename {
  select {
    lookahead ID & {isType(LT(1).getText())}? => match ID;
    lookahead ID & {isSpecial(LT(1).getText())}? => match ID;
    lookahead "int" => match "int";
  }
}
...

Hmm...The syntax is inconsistent or at least the {...} is highly overloaded to mean set, code block, recognition block, etc...

Anyway, subrules are different select blocks. For example,

(A|B)? C

would yield:

select {
  lookahead A => match A;
  lookahead B => match B;
  lookahead C => break;
}

A star loop would be similar:

(A)* B

yields

loop(n>=0) {
  lookahead A => match A;
  lookahead B => break;
}

Note: code generators working off this format could optimize this as simply "while lookahead is A" or "while lookahead is not B".

No plus loop:

(A)+ B

yields

loop(n>=1) {
  lookahead A => match A;
  lookahead B => break;
}

This syntax makes repeated structures like

(A)[4] B // match 4 A's then a B

pretty easy and consistent:

loop(n=4) {
  lookahead A => match A;
  lookahead B => break;
}

Now, for LL-regular grammar, we'll have to be able to specify a state machine somehow. It seems like a waste to convert the internal DFA to a text regex and then back to a regex. Hmm..perhaps the DFA testing format could be used (I manually specify what the DFA should look like and then compare it against what is generated by the tool). I guess just a dump of the states and transitions will be what we need like:

0-A->1->B->2

might result in:

n=3;
vocab:A..B
accept:2
0->1:A
1->2:B

On the other hand, this all seems like a lot of work to code even if really useful. Hmm...

November 18, 2003

[Medical doctors are always late by about 1 hour minimum, hence, I brought a notepad to his office...results herein]

A primary goal is easy code gen for a variety of languages. This necessarily requires a multi-phase translation approach so people can jump in wherever they want in the code gen process. Here is what I'm thinking

grammar -> NFA -> NFA+DFA -> ParseLang -> NIL -> C, C++, Java, Perl, ...
                     |           |
                     |           |-------------> Perl, ...
                     |
                     |-------------------------> LISP, other weird stuff

where ParseLang is some antlr parser intermediate form and NIL is some "neutral imperative language". NIL (just a random name now) would also be presumably the ANTLR grammar action language as well. Given a NIL translator to language L, you could translate actions as well as generate parses in L.

If your language is sort of an imperative language, but perhaps with higher level recognition capabilities, you would probably generate code from the ParseLang stage as imperative method calls etc.. would be too low level.

If you have a truly non-imperative language such as LISP or other functional language, you would branch off the grammar representation containing the lookahead analysis: NFA+DFA stage.

ParseLang would look like the AST for a high-level language. The reason I don't go straight to NIL or Java for example is that ANTLR has to break apart some alternative lists due to lookahead and to insert tree actions. There should be an easy way to dump this from the internal grammar NFA representation. From there, it's a simple matter of translating to a imperative-language neutral form and then to final instruction selection.

ParseLang would also be a convenient place to interpret ANTLR for prototyping.

As an example, consider

s : a[42] | Z ;
a[int i] : A B | C ;

The ParseLang tree might be

RULE
 |
 s -- ARGS -- SELECT
                |
               CASE ---------------------------- CASE
                |                                 |
               SET{A,C} -- SLIST                TOKEN(Z) -- SLIST
                             |                                |
                           INVOKE                           MATCH
                             |                                |
                             a -- ARGS                        Z
                                   |
                                   INT(42)

Note that there would be no infrastrutcure variable definition like AST pointers and labels etc... That would only appear in an actual output language or in the NIL format.

From ParseLang you would translate the SELECT to a SWITCH in a NIL format etc... Using translation templates, you could even go straight to a programming language like this:

#(SELECT #(CASE expr slist)) ==> "switch (LA(1)) <cases:{"case <expr> : <slist>; break;"> }"

#(INVOKE ID #(ARGS (arg)+)) => "<ID>(<arg>);"

Or, more likely, these output templates could be used as a convenient way to generate NIL trees by running the NIL parser over the output templates. This lets you use text as humans like to read as the output of a translation phase while still getting out a trees. :) Cute. From the NIL tree you could translate to any imperative language easily.

I'm getting more and more fond of the notion that ANTLR will have a simple standard imperative language as it's action language format. This way all grammars are portable to all languages even with symbol table actions etc.... :) Woohoo!

November 15, 2003

Here is an upcoming paper by Bryan Ford to a conference that formally defines grammars (PEGs he calls them) that operate like ANTLR: where alternatives have an implied order. Nicely formalizes the strategy of predicated parsing etc...

http://www.brynosaurus.com//pub/lang/peg.pdf

Just as we have the syntactic predicate that says what the lookahead context must be, he has a "not predicate" that says what the lookahead cannot be. This is very useful for such things as saying "match this rule, but only if not following in context by this next thing." Really powerful. moreover, he has a parser that demonstrates linear backtracking parsing by use of dynamic programming. I will be studying his technique very closely before code gen for ANTLR. :)

October 27, 2003

One of the nice things about antlr is that there are multiple back ends. The problem is that you can't easily retarget a grammar if it has embedded actions. Since I'm going to have an intermediate "langauge" for auto-generation stuff like "build me some trees", perhaps I should make it visible to the users? I expected a simple intermediate thing that was of the form:

template(arguments)

like

declare(name="t", type="int")
assign(lhs="t", rhs="3+4)
functioncall(name="foo", args="t")

meaning (in C, C++, Java, C#):

int t;
t = 3+4;
foo(t);

and then the instruction selector would simply find the assign template for your language and dump the output like a sophisticated macro preprocessor.

Now, consider that translating the actions of languages is pretty easy; it's the declarations that are hard. Well, what I mean is that looking at the actions in grammars I see simple things like IF, FOR, function call, assignment etc... What if I made a generic little language for these actions that could do most stuff you want assuming you do the declarations properly in your language outside the grammar? I mean, it's a drag not being able to reuse a grammar for C++ output just because it has an assignment action in C# in there. Whatever.

I would really like to be able to have a grammar that built symbol table information, for example, that could be reused as-is in C++, C, Java, etc... It's just some pops, pushes, and function calls, right? ;)

October 12, 2003

My first antlr code-gen experiment working with templates. I.e., how to build output generators for ANTLR.

I added a few actions to my antlr AST tree walker in order to make it create templates and fill them with data. The result is complete rule (or set of rules) in Java. :) I'm honing my understanding of what should be where in terms of MVC. I really think that we can do C, C++, Java, C# by mostly modifying some templates sitting on the disk. StringTemplate knows how to inherit templates at this point so you could just subclass Java's templates dir to implement C# as it differs. Heh heh heh.

This only works for basic rule ref and token ref stuff and I faked the analysis just to test the templates. Verrrrry interesting.

I'm officially back working on ANTLR 3. What do you think of the BLOG idea for ANTLR 3? Could someone with more time on their hands (read that the javacc guys) beat us to the punch with all of these cool ideas?

General stuff

Goal: make it as easy as possible to generate a new code-generator. To be able to do this w/o Java knowledge if possible; i.e., use templates. This may require an intermediate serialized form people can read in.

General idea:

Provide templates for common abstractions like program, procedure, assignment, switch, choice, add-child, create-node, and so on.

Then, have a controller map AST construct to the appropriate StringTemplate just like PageDispatcher maps URLs to page body templates. Like the sample I did in OR, the tree grammar sometimes maps to a template (for "concepts" rather than intermediate rules) and sometimes just fills things in in the right order.

Controller will have to publish a list of attributes it will fill in so that overall program templates can reference them like:

program ::= <<
<includes>
<declarations>
<functions>
>>

Controller can be rewritten for really bizarre (i.e., non algol-derivatives) or overridden. Each controller defines a set of templates it knows how to use and a set of attributes per template. May want to extend StringTemplate to allow group of templates in one file.

LL-regular machine:

They will have to code the LL-regular state machine natively however. Takes machine, token buffer as input and yields an integer alternative number or -1 if nothing matched.

SAMPLES

a : A | c D | a=f() | lab:Q ;

A     -> "match(<attr>);"
c     -> "<attr.name>(this, < attr.args>);" # have to pass in "this" for C
a=f() -> "<lhs> = <rule:ruleRef()>;"
lab:Q -> "<label> = LT(1); <token:tokenRef()>;"

rule def ::= <<
<attr.visibility> <attr.returnType> <attr.name>(Parser this, <attr.args>) {
  <labelDefinitions>
  <block>
}
>>
LL1block ::= <<  # somehow alts[i] is object with {label, alt} properties
switch ( LA(1) ) {
  <alts:blockCase()>
}
>>

blockCase ::= <<
case <attr.label> : <attr.alt:alt()>
>>

Can they define a Java object somehow to handle the truly bizarre cases? A mapping to that instead of a template, that is?

Interesting...must fill bottom up. Walk top-down until you hit the leaves and then start with the token refs etc... to build up more and more complicated structures. This implies that you don't really need to apply templates to attributes...that is trying to make the template walk. Make the controller fill in. Use template refs for formatting issues not walking downwards in the tree. In fact, it can't: you are not passing in the tree. Must separate the data types, structure of the AST from the template which is used to just spit code.

So then the result of each conceptual chunk like atom is a StringTemplate. Same for alt:

alt returns [StringTemplate st]
{
	st = new StringTemplate("<elements; separator=\"\n\">");
	StringTemplate a = null;
}
	:	#( ALT ( a=atom {st.setAttribute("elements", a);} )* )
	;

Then moving upwards:

block returns [StringTemplate st]
{
	StringTemplate ll1 =
		new StringTemplate("switch (LA(1)) {"+
			<alts>
		"}"
		);
	StringTemplate switchItem = "<labels:{case <attr> :}> <alt> break;";
	StringTemplate llk = new StringTemplate("");
	StringTemplate a = null;
}
	:	#( BLOCK
		   ( a=alt
		     {
		     StringTemplate aST = switchItem.newInstance();
		     aST.setAttribute("labels", #alt.getLL1Labels());
		     aST.setAttribute("alt", a);
		     ll1.setAttribute("alts", aST);
		     }
		   )+
		)
	;

Then have to move template literals into file(s) so you can change the code-generator.

LL1_block ::= <<
switch (LA(1)) {
	<alts>
}
>>
LL1_alt(labels+, alt?) ::= <<
<labels:{case <attr> :}> <alt> break;
>>

Questions:

How to handle generating include files and others? With a second pass?