Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



1

1.33-to-2.20 Conversion Tips

I've been converting 1.33 grammars to 2.0 and thought I'd pass along the
following tips that may help folks avoid some of the problems I've had doing
it.  I hope these tips help -- happy parsing!

Tom Nurkkala, PhD
tom.nurkkala@powercerv.com

  • Note that several of the EBNF notations have changed. In particular, the
    optional clause "{...}", has become "(...)?". This new notation for
    optional clauses conflicts with the old way to express syntactic predicates,
    which have become "(...)=>". Because you'll probably have more optional
    clauses than syntactic predicates, convert the optional clauses first, then
    go back to your old grammar, find the syntactic predicates and change them
    appropriately in the new grammar.
  • Semantic actions are now delimited with "{...}" rather than the old-style
    "<<...>>" notation. This is an easy replacement to make, as there are
    probably few "<<" or ">>" shift operators in your old C++ code, so you can
    do a simple search-and-replace. Note that you should change optional
    clauses from "{...}" to "(...)?" _before_ changing semantic action
    delimiters, when the old optional clauses are still easy to distinguish from
    the new semantic action delimiters.
  • Probably the most challenging part of the conversion will be moving from
    the DLG-based scanner to the LL(k) scanner. Most of the conversions are
    quite mechanical, but some are not. In particular, you now have to address
    left factoring in those productions of the scanner that will return tokens
    to the parser.
  • ANTLR is happiest when you use quoted strings directly in the grammar for
    keywords. Under 1.33, I had defined all my keywords as lexical tokens
    (using something like "#token K_WORD "keyword"). Although doing this avoids
    misspelling problems (e.g., using "while" in one place and "whiel" another),
    ANTLR 2.x is best-suited to using literals directly in the grammar because
    of the way it generates the token hash table, etc. in the resulting code.
    Watch carefully for misspellings.
  • There is no #tokenclass in ANTLR 2.x. The best way to handle such cases
    appears to be to create a new production in the _parser_ that mimics the
    old-style token class (e.g., changing "#tokenclass SQLVerbs { K_SELECT,
    K_DELETE, ...}" to something like "sqlVerbs : "select" | "delete" | ...").
  • Handling numeric literals is more problematic in 2.x. In particular, if
    you have a language that has "similar" literals (e.g., integers, reals,
    dates, times, etc. as are present in a database-focused language), you'll
    have more work to do in the LL(k) scanner environment. It appears easiest
    to collect these literals into a single scanner production and either
    left-factor or make use of syntactic predicates. You can set the token type
    in each alternative using a specific semantic action in each disjunct of the
    production (e.g., "{ _ttype = NUM_FLOAT; }"). (Note that if you use
    the -diagnostic switch on antlr.Tool, the scanner's ".txt" file includes
    what seem like spurious complaints about setting _ttype in this manner. The
    warnings can apparently be safely ignored.) See the sample Java grammars
    (particularly Scott's new one) for examples of how to do this type of thing.
  • Use the "protected" flag on lexer rules that are only being used as
    "helpers" (e.g., on a "DIGIT" production that's used in other lexer
    productions for integers, floats, etc.). Not only does this make the
    resulting method in the output protected, it is also used by ANTLR to modify
    its test for ambiguous rules in the scanner, eliminating some
    "non-deterministic" warnings. See examples of this in Scott's new Java
    parser.
  • When generating ASTs, it's often helpful to create "dummy" nodes that
    have a token type that's used only to make AST traversal unambiguious (i.e.,
    "flag" various subtrees so that the tree parser doesn't have to fool with
    resolving ambiguous tree structures). Under 1.3x, such dummy token types
    could be created using #token with no pattern (e.g., "#token D_DUMMY").
    Under 2.x, you can create dummy token types with a production that simply
    has the dummy values as disjuncts (for example, "dummyTokens : D_RED |
    D_GREEN | D_BLUE;"). Such a production will cause the tokens to be created,
    added to the TokenTypes output and so on. You can then refer to the dummy
    types in semantic actions used to build ASTs. Be sure NOT to refer to the
    "dummyTokens" production elsewhere in your grammar!
  • Make use of the "-diagnostic" flag on antlr.Tool. The ".txt" output for
    your parser(s) and scanner(s) are very helpful in diagnosing conflicts and
    ambiguities. Using the txt files in conjunction with the ANTLR output
    itself is the easiest way to figure out which alternatives are conflicting
    with which when there are ambiguities. Note that when the ANTLR output
    refers to "line 0", it's really talking about the "nextToken" function, the
    alternatives for which will appear first in the scanner txt file.