Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



Stuff I do not like in ANTLR 2

RSS Feed
Updated March 10, 2004

In no particular order, here is a list of stuff that I don't like or must be fixed for ANTLR 3; includes stuff about the site, support, build, debugging. My comments are followed by the (in arrival order) comments of others. This is not a complete list, just the stuff that annoys the hell out of me. Please send (parrt at antlr.org) your complaints. :)

Thanks to all for the encouragement and for caring enough to complain! ;)

Terence Parr spake:

  1. no unit/functional tests
  2. the build process
  3. no separation of runtime vs tool
  4. no way to get the text or tokens matched for a parser rule.
  5. parse trees are sometimes useful. Automatic generation would be nice. My new derivation sequence generation should be automatic as well.
  6. EOF must be handled properly in lexers (i.e., a virtual character)
  7. Lexers are pretty damn slow
  8. Labels are unique to a rule, which was done for exception handling. It's annoyhing though because you can't do this (a:ID|a:TYPE) to have 'a' set to whatever was matched. Now you have to set two different labels and then figure out which was set. Ick. Sriram Srinivasan brought this to my attention again.
  9. Approx lookahead vs full LL(k) bites me about once a grammar.
  10. I hate having to left-factor my lexical rules (though I love the LL(k) nature of them). Python would have been MUCH more difficult w/o full LL(k) lexers I think.
  11. I hate not having a good way to resolve nondeterminisms.
  12. I want an easy way to start a new grammar (wizard or otherwise).
  13. I don't think I like the inheritance model for reusing grammars (not to say you don't want inheritance; you want it at the implementation level not the grammatical level).
  14. I hate how hard it is to build a code generator.
  15. I don't really like how expressions work. A precedence parser would be much better. Hard to explain to people as it is and it's slow.

John Mitchell

Lack of predicate hoisting.

Matthew Ford

His thoughts are long enough to warrant a separate page: Tree building proposal

Monty Zukowski

I would like a better way to manage all the warnings. I don't particularly like the warnWhenAmbig=false flag. For one it doesn't work for all the cases I want it for.

I think a companion tool could be better at annotating error messages and presenting new ones/hiding old ones, etc.

An ANTLR Cookbook would be quite handy too, and could be built on top of the existing examples, and my C grammar :)

Of course there's a slew of tree things but Loring's got that covered from the design cabal.

Allow syntactic predicate to break out of a loop.

Interactive parsers are not so fun with ANTLR 2.

I would like to be able to trace a Token all the way back to file offsets so I could modify files in place, not have to regenerate them entirely. That's not so easy with Unicode.

I also hate dealing with ambiguous keywords!

Chris Daly

  1. (I thought of this last but I'm adding it at the beginning because it's so important to me!). Those of us wanting to use Antlr within a corporate environment have to do something to make the laywers happy. I think the singlemost important thing you can do here is to have a mechanism for registering contributers. Each contributer should be reachable (email is fine but the more contact info you have the better) and should have made some kind of affirmation that they agree with the license (this affirmation could be an email that you print and save, but a signed form faxed or snail-mailed would be even better).

I suggest looking at some of the bigger open source projects like Mozilla or Eclipse or Apache to see what mechanisms and forms they use.

Laywers would be even happier if the contributers all assign their copyrights to one person (i.e. Terence) or entity (like U of S.F.) but I don't think this is necessary as long as all of the contributers are contactable (and there aren't so many that it becomes extremely difficult to contact them all).

Beyond that, you mentioned before that you are considering BSD as the license. That would work for me. GPL or LGPL would totally disallow me from using Antlr. CPL (the license Eclipse uses) would be ideal for me but BSD is very doable.

Here are the more technical suggestions:

  1. Ignore options that Antlr doesn't care about. Warning about unknown options is ok, but don't just bail out. I have some cases where I am parsing the .g files to generate some code that works with the generated parser. So I want to be able to define my own options that my tool looks at but Antlr ignores.
  2. never call System.exit()! Throw an exception instead.
  3. I second your #6, "Labels are unique to a Rule". I would also recommend replacing the name:TOKEN syntax with name=TOKEN. Using the colon in that context can cause some head-scratching bugs. Look at the rule(s) below and see what can happen now when you forget the semicolon at the end of a rule.
    rule1 : a b
    rule2 : X;
    
  4. When a parser refers to an undefined token complain about it. For example the following test case compiles without error even though token DASH is not defined anywhere:
    class TestParser extends Parser;
    sos : s o s;
    s : DOT DOT DOT;
    o : DASH DASH DASH;
    
    class TestLexer extends Lexer;
    DOT : '.';
    
    TJP: it turns out this is often what you want, but not always. Ric had this turned on for ANTLR and it made my java grammar complain like mad. All of those actions in a lexer that say "$setType(FLOAT);" in rules like NUMORINT will result in errors.
  5. I'd like to see a better system for managing parsers and lexers in separate files. I don't have any specific ideas here, just the feeling that the TokenTypes files introduce an unnecessary extra level of confusion.
  6. I want to second what Monty said:
      I would like to be able to trace a Token all the way back
      to file offsets so I could modify files in place, not have
      to regenerate them entirely. That's not so easy with Unicode.
    

I think Token offset is a more fundamental concept than line/column. If the latter are available then the former should be too.

  1. Some way of using a literal without adding it to the literals table. I always end up with rules enumerating the keywords that are also legal identifiers, like:
    id: ID | "foo" | "bar";
    
    There could be some syntax like "foo"# that means don't (or do) add this to the table.

Ric Klaren

  • I seriously dislike having to dequote java strings in the codegenerator and checking them for sanity (multibyte sequences) and then repackaging them again. I'd rather see the lexer supply int arrays or something similar so I don't need to worry about quoting other than during writing stuff out.
  • Also the lexer should check the charactervocabulary for sanity (no values 0,1,2,-1) And all string/character literals should be checked before passing them to the codegenerators.
  • Loring reminded me just now in a post: I want to be able to reference EOF in the lexer as a 'normal' token. uponEOF is a kludge.
  • ANTLR should stay a commandline tool (the core functionality for grammar inheritance or whatever it should replace I don't care) if it doesn't run from a Makefile I'm not interested ;)
  • More documentation in the code.
  • The lack of warnings/errors for incorrect use of options/constructs
  • Not being able to specify a default template errorhandler for the whole grammar.

Internal things:

  • Internally I'd like a better interface between codegenerator and what is now the action parser. (if that's still an issue with the new codegen/syntax) The near heuristics now used to do the right thing to translate a #treethingy into something sane is a horror.
  • Codegen wise I'd like to know more things before I start generating a piece of code, so I can cut down on declarations etc. Stuff that does not need to be constructed does not cost cpu cycles.
  • I dislike the 'all-over-the-place' system used for grammar/file options. I'd prefer having the included codegenerators register commandline options and handlers in the maintool (and get rid of all the globals for them).
  • Clear semantics for things like:
      ( { x < 4 }? myRule )*
      ( { x < 4 }? myRule )+
    


  • Clear documented semantics for the scope of a variable defined in actions:
        (  { int somevar; }
    
        )*
    
    Or:
        (  { int somevar; } :
    
        )*
    
    Or dirty stuff like:
        (  { if( i > 10 ) break;  } :
          someRule
        )*
    
        (  { if( i > 10 ) throw ....;  } :
          someRule
        )*
    
    Or a statement like 'You're on your own!'.

  • Consistent importVocab/exportVocab behaviour when the lexer/parser are in the same file and in the separate files cases.
  • it would be nice if ANTLR would examine dependent grammars and remake them if needed. Would probably require some extra options for specifying the search path for grammars/vocabs. (we definitely need more control for this as someone else pointed out as well)
  • It would be nice if ANTLR3 was designed from the outset for heterogeneous AST and Token support. In antlr2 this was not the case and it shows.
  • I would love a shorthand syntax for:
      ( stuff ) ( DELIMITER stuff )*
    
    Where delimiter is usual a single token or a set of tokens. The advantage is that the action code for stuff can be the same. In the current implementation you have to keep some near identical bits of code synchronized with the occasional copy paste error resulting from that. ell used a syntax like ( stuff || DELIMITER )* or something along those lines.
  • It would be nice if antlr would warn for common mistakes like rule* in stead of ( rule )*

Pete Forman

I still have a feature request. Might we have a "-i" command line option to specify a directory other than the current one to locate the import vocab file. This would complement "-o" for those such as myself who keep source and generated/object files in separate directories.

Anthony Youngman

I think that gcc allows you to specifically disable certain warnings--along the lines of

warn everything except warning number 82

I suspect that when I get things working the way I want, my screen will explode in ambiguity warnings! The problem is that there might be one warning I really need to see, and I miss it in the screeds of ambiguity crap I can't suppress.

I'd suggest that you allocate numbers to all your warnings (you probably do that already :-) and then print your warnings via a call to a central routine. Then have some way of telling that routine that, if it's called for warning(s) x, it should return without doing anything.

Mike Tiller

I'd like the C++ runtime to use data structures that handle heterogenous tree construction better. I gather the current approach is based on trying to mirror the Java side of things (at least that is what I recall Ric saying). I think a more C++ish design is necessary. I used heterogenous trees in my project and I don't regret it (in the sense that I strongly prefer heterogenous trees), but it sure was a pain to work through all the inheritance, reference counting, type casting, issues.

Robin Debreuil

One thing I find strange is using exceptions for flow control in the generated code. I prefer exceptions to just be used for things you aren't expecting in code, probably that is just a style thing though. Just this way they seem to be much like gotos with global variables for state...

For the C# version, it would be really nice to use Enums for all the tokens, or better yet, categories of them. It makes debugging much easier, and the whole thing becomes a bit more 'solid'.

Above all though, it would be great to have more error information available, both in the grammars and when running the generated code. The program itself is designed to facilitate building that kind of thing into languages, so it seems kind of like the cobblers kids going without shoes. Maybe restrict what is valid syntax in a grammar and catch more common errors. For the generated code, maybe a debug version - where it can tell you things like the statement that it couldn't get past etc. Maybe even things like setting breakpoints on input files.. Probably that would be hard, but with all the guessing levels, gotos, exceptions, etc, it can get pretty hard to trace. I may be overlooking a few exisiting techniques here though, I'm pretty new to it.

Brian Smith

All the runtime error messages that generated lexers/parsers produce should be localizable and just generally easily customizable.

Steve Silber

Better msvc integration support. I know we all hate M$, and I'm no exception. But for the love of god, there's a LOT of development going on in MSVC these days, and ignoring the poor saps (like me) who have to use it is not so nice. Basically, we need for ANTLR to be truly platform agnostic, not just *nix agnostic.

To whit:

  • NMAKE-compatible makefiles for the C++ libs. If we can cobble up NMAKE makefiles, then we're cool with Ric--we're purely command- line. I got no problem with that, since if you can make an NMAKE file, you can make a GUI project for it.
  • Has anyone mentioned smoke testing of builds yet? If that's already happening, let's get MSVC as a test target for the C++ libs.
  • A warning-free build for the C++ libs, for all major targets. A Windows build is still littered with warnings all over the place.
  • Let's get multiple error reporting formats. Make them command-line selectable. MSVC mandates a specific format for error output from external commands in order to integrate them fully into the environment (eg. being able to double-click an ANTLR error in the output window and it taking you to the offending line in your grammar file).

Sriram Srinivasan

  • I love the idea of using a regex to express a variable amount of lookahead.
  • For ANTLR3, use the built-in collections and java 1.5 generics
  • Need support for associativity and precedence.
  • I don't always understand ambiguity warnings. It would be nice if ANTLR could produce a counter example ("These are possible alternatives in the input which these productions can't disambiguate")
  • Should be able to use the same label:
    m:STATIC | m:PRIVATE | m:PUBLIC ...
    
    Of course, one could do
    {m = LA(1);} (STATIC | PRIVATE | PUBLIC ...)
    
    but I should be able to do it either way
  • Have ANTLR optionally call javac or jikes on the generated code and then fix the LineNumberTable in the class file, so that the lines correspond to the actions in the grammar file. That way, one can use the
  • Native Integration with Idea would be sooo nice --- semantic support (not just code coloring), outlining etc.
  • Better support for keywords and identifiers. I should be able to have standard tokens like "if", "then" and have my own identifiers. If we can give built-in strings more priority, then we don't have to worry about clashes between keywords and identifiers.
  • Token.toString() can use reflection on the generated TokenTypes interface to map its type to the corresponding name. This facility can also be provided as a static method for custom Token classes that don't inherit from CommonToken.
  • Why do the generated methods not return the Token or the result? (That is, why are they declared void?)
  • Since the lexer's performance is usually more critical than the parser, here are some observations:
    • Lexer visits each character at least thrice:
          if (LA(1) =3D=3D '>' && LA(2) =3D=3D '>') {
               match(">>");   // It should be sufficient to say advance(2)
      
      then later
           return new Token (... new String(...)") // another copy.
      
    • setText should do intern the string. This improves parser performance considerably becase one can always do instead of String.equals. I have attached a StringSet class that you may find of use. It works on String, StringBuffer and char[] keys. The lexer can accumulate characters in a char array, and StringSet.put() will return the corresponding interned String. This way, you need to have only one char[] array in a lexer and only produce new String objects if they didn't exist before. This way, all tokens get very efficient interning (much faster than String.intern) and you produce far fewer objects.
      On that note, I also use a KeywordMap class that accepts char[] and StringBuffer
    • Copying/buffering is necessary only if the source is a stream, not if it is a CharSequence or an array etc.
    • Most generated code can be optimized away. Instead of the following code for the lexer production -- AND: "&"
          public final void mAND(boolean _createToken)
                      throws RecognitionException, CharStreamException, 
      TokenStreamException {
            int _ttype; Token _tokennull; int _begintext.length();
            _ttype  AND;
            int _saveIndex;
           
            match('&');
            if ( _createToken && _tokennull && _ttype!Token.SKIP ) {
             _token  makeToken(_ttype);
             _token.setText(new String(text.getBuffer(), _begin, 
      text.length()-_begin));
            }
            _returnToken  _token;
           }
      
      I think we can infer enough from the context of the production to have:
          public final void mAND(boolean _createToken)
                      throws RecognitionException, CharStreamException, 
      TokenStreamException {
            advance(1); // no boundary or error checking here
            _returnToken  makeToken(_ttype, "&")
          }
      
      Note that the string for the operator is automatically interned.
      We need to check forToken.SKIP only if the action code has this string.
      Is _createToken really needed? A production either produces a token or it doesn't, and this information is available at grammar compile time.
  • The barrier to entry for a newcomer is still high. I have some experience with ANTLR, and I too would like a better way to start a new grammar from scratch (compared to copying). Perhaps, we can have a wizard that walks the user through a set of questions and even determines lookahead automatically. The questions could be like these
    • "Which of the following languages does your input most resemble
    • java, javascript, python, /etc/password, comma separated lists, etc"
    • "Here's a default list of tokens. Make appropriate changes."
    • "Multi-line comment format"
    • Single-line comment format
    • Whitespace.
    • The code for newlines etc. is automatically generate.
  • Or perhaps, the wizard can be "trained" using bunches of sample code
  • Ideas for documents
    • Have HOW-DO-I notes on various topics : associativity, precedence, getting started etc.
    • Have quizzes on the website for different tracks: "beginner, intermediate, expert, Parr". Gives targets for learning ANTLR and parsing concepts.

Tom Moog

TJP: Tom Moog is the super smart guy that has augmented and maintains PCCTS.

Thoughts on code generation and semantics