These are some problems in pccts which may exist in antlr2 also. Follow set caching ------------------ There are problems analyzing expressions of the form: rule: (A1|B1) {A2|B2) ... (A25|B25) because follow sets are cached only for rules. In this case 2**25 alternative would need to be computed. However, if each subrule had its follow set cached, it would require only 50 evaluations. Two kinds of semantic predicates -------------------------------- There are two kinds of semantic predicates needed. One kind is used to creat a "virtual token" which combines token information with a semantic test. This is the classic C typename test. The other kind of semantic predicate is used to control flow and does not depend on context. In short, type 1 semantic predicate always depend onOB lookahead while type 2 semantic predicate never depend on lookahead. Lexical ------- Support for Unicode up to at least 0x10ffff (current xml range). Tolerance for binary zero would be desirable. After all, flex can do it. Epsilon rules ------------- Consider the following rule: r1: A ( B | epsilon )* C ; with input: A C In pccts this results in an ambiguity warning because it can't decide whether whether to enter the epsilon branch or bypass the entire subrule. When there is no action in the epsilon alternative antlr3 should allow it and handle it in the "natural" way - bypassing the subrule Non-traditional semantic predicates ----------------------------------- Reinier van den Born advocated allowing boolean combinations of semantic predicates and lookahead. Here is my interpretation of his idea: A Note on the New "&&" Style Guarded Predicates I've been asked several times, "What is the difference between the old "=>" style guard predicates and the new style "&&" guard predicates, and how do you choose one over the other" ? The main difference is that the "=>" does not apply the predicate if the context guard doesn't match, whereas the && form always does. What is the significance ? If you have a predicate which is not on the "leading edge" it cannot be hoisted. Suppose you need a predicate that looks at LA(2). You must introduce it manually. The classic example is: castExpr : LP typeName RP | .... ; typeName : <>? ID | STRUCT ID ; The problem is that typeName isn't on the leading edge of castExpr, so the predicate isTypeName won't be hoisted into castExpr to help make a decision on which production to choose. The *first* attempt to fix it is this: castExpr : <>? LP typeName RP | .... ; Unfortunately, this won't work because it ignores the problem of STRUCT. The solution is to apply isTypeName() in castExpr if LA(2) is an ID and don't apply it when LA(2) is STRUCT: castExpr : (LP ID)? => <>? LP typeName RP | .... ; In conclusion, the "=>" style guarded predicate is useful when: a. the tokens required for the predicate are not on the leading edge b. there are alternatives in the expression selected by the predicate for which the predicate is inappropriate If (b) were false, then one could use a simple predicate (assuming "-prc on"): castExpr : <>? LP typeName RP | .... ; typeName : <>? ID ; So, when do you use the "&&" style guarded predicate ? The new-style "&&" predicate should always be used with predicate context. The context guard is in ADDITION to the automatically computed context. Thus it useful for predicates which depend on the token type for reasons other than context. The following example is contributed by Reinier van den Born (reinier@xxx.com). +-------------------------------------------------------------------------+ | This grammar has two ways to call functions: | | | | - a "standard" call syntax with parens and comma separated args | | - a shell command like syntax (no parens and spacing separated args) | | | | The former also allows a variable to hold the name of the function, | | the latter can also be used to call external commands. | | | | The grammar (simplified) looks like this: | | | | fun_call : ID "(" { expr ("," expr)* } ")" | | /* ID is function name */ | | | "@" ID "(" { expr ("," expr)* } ")" | | /* ID is var containing fun name */ | | ; | | | | command : ID expr* /* ID is function name */ | | | path expr* /* path is external command name */ | | ; | | | | path : ID /* left out slashes and such */ | | | "@" ID /* ID is environment var */ | | ; | | | | expr : .... | | | "(" expr ")"; | | | | call : fun_call | | | command | | ; | | | | Obviously the call is wildly ambiguous. This is more or less how this | | is to be resolved: | | | | A call begins with an ID or an @ followed by an ID. | | | | If it is an ID and if it is an ext. command name -> command | | if followed by a paren -> fun_call | | otherwise -> command | | | | If it is an @ and if the ID is a var name -> fun_call | | otherwise -> command | | | | One can implement these rules quite neatly using && predicates: | | | | call : ("@" ID)? && <>? fun_call | | | (ID)? && <>? command | | | (ID "(")? fun_call | | | command | | ; | | | | This can be done better, so it is not an ideal example, but it | | conveys the principle. | +-------------------------------------------------------------------------+ Separation of input processing from follow set computation ---------------------------------------------------------- Nice in theory, but not sure it would be used: There are applications in which it would be nice to construct the grammar via an API, call antlr to compute the first/follow sets or whatever, and then use that information with a custom code generator - for instance a virtual machine rather than Java/C++. Alternative to NFA/DFA for regular expressions ---------------------------------------------- I've read recently of something called "regular expression derivatives". There is an article on it by Joe English. binary decision diagrams ------------------------ Another idea suggested by Reinier van den Born is the use of binary decision diagrams (bdd) for combining predicate expressions and lookahead expressions in a uniform way. I read a few survey articles about them. They provide a canonical representation of boolean combinatorial functions. That is, two such functions are identical iff they have the same representation as bdd. What bdd might provide is a way to represent expressions which are more complex expressions than regular expressions (RE). I'm don't know how bdd compare to RE in terms of recognition power. This kind of thing might be useful for handling the very large tree of predicates and lookahead context created by predicte hoisting. Even naive simplication of such expressions resulted in huge reductions in expression size. bit set volume -------------- The volume of memory required for bit sets grows as N**2 because the number of terminals tends to grow as N and the number of tests grows as N. For very large grammars this is a bit of an annoyance (pun). You might consider a technique which would try for greater locality of bit sets so that it for non-pathological cases it would be N**1.5 or N*log(N). An alternative would be run-length encoding for sparse bit sets. The bits sets could be expanded at startup time. I believe JFlex uses this. exceptions for guess mode ------------------------- Using exceptions provides a standard mechanism to unwind the stack and gives the user control via catch blocks. All this with only minor modification to the code generation. Using return codes means you have to support two different code generation strategies - one with return codes for guess grammars which use guess mode and one without return codes for grammars which don't use guess mode. This will result in more alternative in the code generation and more bugs. Set Difference -------------- Is there any way to say: this token matches all the tokens *not* in the FIRST set of some other alternative or not in the union of the FIRST sets of some other alternatives ?