Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



ANTLR 3 Lexers, Parsing Integration

RSS Feed

November 6, 2004

Had a long conversation with John Mitchell today and firmed up a very very clean answer to all my problems. John suggested a very nice filtering scheme too that obviates the need to do skip() or any of that cruft in the lexer.

Simplified lexers

After reviewing a bunch of lexers, it seems that much of the fancy functionality like '!' operator and such are not used. I am stripping down the lexers so that they have notions of:

  • a token or multiple tokens generated by each rule
  • line information settable only via newline()
  • column info is now "char position in line" info; no more messing with tabs
  • all input is buffered by default
  • tokens by default have start/stop indices into the char buffer rather than a bunch of String objects.
  • tokens have a channel number that the parser can "tune" to. You can set the channel by "channel=WHITESPACE;" or whatever.
  • tokens have line, char position info by default
  • the parser grammar (or combined grammar) can specify the extra fields for a token, which results in a grammar specific token. Tokens may also have a generate attributes table for dynamically setting attributes, thus, avoiding creation of a million token subclasses.
  • emit(...) overridden in a lexer subclass to change what token is created.
  • you cannot set the text of a token; you can only create a token and emit it.
  • I'm reducing the "modal" nature of the lexer so token type and channel info are local variables, carried into a token via an automatically generated emit(makeToken(...)); call at the end of each lexer rule. If you don't like the overhead, change the codegen templates. :)

So in lexer actions you can do this:

emit(some-token);
type=ID;
channel=WHITESPACE;
newline();

and that's about it. No parsing of $setType type stuff.

Token buffers and lookahead

So, the lexer always creates tokens for comments and whitespace, which is the commonly needed functionality. The parser simply ignores any tokens whose channel is not the comments/whitespace channel. It may cost a wee bit in the lookahead operation but not as much as you'd expect due to the left-factoring implicitly done by DFA decision making.

The parser will incorporate TokenStreamRewriteEngine functionality so that you can always alter the input stream to do easy rewrites. That functionality will go a long way.

AST creation will ultimately just point into the token buffer rather than copy text and so on.

November 5, 2004

Literals go into DFA, parser-lexer token type communication

After discussions with the usual suspects, particularly with Loring Craymer bugging me, I'm convinced that we should not have a literals table like ANTLR 2.x had. This was a liability due to weak LL-based lexers (LL lexers shine elsewhere). Part of what tipped me over the edge towards literals in DFAs, was my generation of bytecodes. DFAs in java via static objects is just too huge. By my estimates we can get about 2000 states into a single method, hence, single DFA. There are perhaps <100 keywords in just about every language, which should fit just fine. We could get maybe 250 keywords into a single DFA.

At first, I thought that this simplifies the interaction of lexer and parser. The parser grammar could reference anything in double quotes--the string gets would get converted to a rule in the lexer. For example,

grammar P;

a : A "abc";

would yield a P.tokens vocabulary interchange file:

A=1
"abc"=2

Then, a lexer grammar could pull in this vocab:

lexer grammar T(tokenVocab=P);

A : 'a' ;

The generated lexer would behave as if you had typed:

lexer grammar T;

A : 'a' ;
TOKEN2: "abc";

Every literal in the parser would get shoved into the lexer as a separate rule.

BUT, I've just decided I don't like this given the discussion of integrated specs below. When you build a lexer, the grammar would dictate completely the rules. The incoming vocab would merely set the token types so a parser and lexer could sync up. I don't like the idea any more of having a parser be able to force behavior on a lexer (then it's consistent for people that build their own lexer or use, shudder, flex).

Literals in your grammar will not be sent to the lexer. I guess that implies that you could not use "for" in the parser grammar, you'd have to say FOR. That is, unless you use a combined spec. The only means of communication between parser and lexer vocabulary-wise is token types. Clean and simple. I like it. I guess this will encourage the use of combined specs. Hmm...they could get big, but PCCTS seemed ok with combined spec.

Integrated parser, lexer specification

With the power of recursion and real LL-based lexing (instead of a DFA based approach), ANTLR 2.0 lost some convenience of specification. It was harder to just list a bunch of lexer rules and have ANTLR know how to separate them (e.g., INT vs FLOAT). You had to do lots of left-factoring. Ick. Anyway, this made it more difficult to reproduce the single specification PCCTS (ANTLR 1.0) introduced so ANTLR 2 dropped the concept.

ANTLR 3.0 is a hybrid approach. It uses LL(*), which is LL plus a full DFA decision-making algorithm that can see over arbitrary left-prefixes common among more than one rule. We get the best of both worlds. With this increased power, one should simply be able to list a bunch of lexer rules and have ANTLR separate them while still providing recursion, readable code, and other LL-based lexer advantages. If we can just list rules and have it work, we should be able to merge the lexer/parser spec again, at least for many languages. Naturally, from the single spec, I would just automatically create the lexer grammar for the user. For example, one might code the following combined grammar:

grammar P;

tokens {
	EQ="=";
}

s    : ID "=" expr ";" ;

expr : INT | FLOAT ;

ID   : ('a'..'z')+ ;

INT  : ('0'..'9')+ ;

FLOAT: INT '.' INT ;

ANTLR would automatically generate the lexer grammar:

lexer grammar PLexer;

EQ : "=" ;

TOKEN3 : ";" ;

ID   : ('a'..'z')+ ;

INT  : ('0'..'9')+ ;

FLOAT: INT '.' INT ;

It saves you the trouble of making the separate file and possibly mismatching the literals referenced in the parser and defined in the lexer. (Actually, ANTLR wouldn't even have to generate a file, it can build a text string and then invoke {Grammar}'s constructor to suck it in. Cool, I never did like the {parser.dlg} intermediate file from PCCTS days.)

More importantly we can provide error messages now if you reference a token that has no lexical definition; Ric Klaren will jump for joy!

How does this automatically-generated lexer file affect debugging? I think that it would be fine to let programmers debug the code generated specifically for the lexer (even though they did not create that lexer file specifically). As far as error messages, they would all point into the combined spec the user wrote.

Does a combined spec remove the need for the tokens section? Nope. That item is used to introduce imaginary tokens and associate tree node data types with tokens etc...

"Protected" lexer rules

We would still need lexical rules that don't yield tokens--they are just referenced from other lexer rules. ANTLR 2 uses the protected keyword, which is hideous. I'd want to be able to do this still:

INT  : (DIGIT)+ ;
DIGIT: '0'..'9' ;

Heh, we don't need a keyword! If the token rule, DIGIT here, is not referenced from within the parser rules, it's not a token! Heh, that's awesome! Hmm...what to do when the lexer is in a separate file? Oh, if DIGIT is not in the token vocabulary import file, then it's not visible outside and is hence not a real token rule. Heh, I like this! Woohoo!

Lexer rule return values

I'm thinking that each method generated for a lexer rule should return a token. My only hesitation is for rules that are not real tokens. It would be expensive to compute/create a token object for each reference to, say, DIGIT. I do want the text matched for any rule, though. Perhaps, each rule returns an index into a text buffer indicating what text was matched for that rule. No, would need a range; would have to create a range object. Just as bad. I don't like the way ANTLR 2 does it--only create a token if somebody references the protected rule. Ah! I know, let's just leave it up to the programmer to specify what a rule returns if it's a non-token rule. For example, you could define DIGIT to return a char.

Rules like INT that are real tokens would implicitly return a token?

Setting token type, text, token

If one needs to alter the token type in a rule, we've been using $setType etc..., which is a drag because then ANTLR needs to parse lexical actions and look for $setType. Perhaps we should simply mandate that each language target should do what is natural in that language. For example, Java/C++ could do setType and C could do set_type. The think I don't like this because a method may not be what you want. You might want to set a local variable called ttype as is done now, which is used to create the token for the rule.

INT : ('0'..'9')+ ('L' {set return type to be LONG not INT})? ;

Also, an invocation of nextToken could actually generate more than one token; sometimes you need to insert extra tokens (though that is rare). Perhaps we need an emit() method that just adds a token to an output queue. Then the user can compute whatever tokens they want and do an emit. But using the emit, there would be no "global" instance variables for token type, token etc... Methods for lexers wouldn't return anything then by default--they just compute and add tokens (or none at all).

By default, each token rule would call emit at the end of the rule? Nope. We'd get duplicate emit instructions if the user put one inside a lexical action and I don't want to have to go find all the emits in the actions. I suppose you could check for empty token object emission programmatically by checking if the output queue had changed. That would work: if you don't emit a token, one will be emitted for you based about the text matched for the rule and token type of the enclosing rule. If you have one token rule call another token rule, the invoking rule would know still that a token had already been emitted and would skip the automatic emission I guess.

Ok, back to setting just the token type not the token. Perhaps we mandate that a local variable exists called ttype and you just set it directly in any target language syntax.

INT : ('0'..'9')+ ('L' {ttype=LONG;})? ;

The INT method would start by setting ttype=INT; and then you could "override" it by setting ttype manually. If you do not emit a token object manually also, then the INT rule would emit a new token object based upon ttype and the text matched for that rule. Well, that is, unless ttype is the "skip" token type.

Being efficient with text / tokens

Keywords and operators are very common tokens for just about any input so it seems a waste to recreate a new token and string for them each time. The only problem is that you would have screwed up line number information, but for keywords that may not be too much of a problem. Anyway, you can manually hack this:

FOR : "for" {emit(staticFORTokenObject);}
    ;

Normally, a lexer has a text buffer (StringBuffer in Java) that builds up the text for a token. Then a Token object is created and a String is created from the text buffer and stuffed into the Token. Most token text could be more efficiently specified with two indices that indicate the start/stop index in the incoming char stream. This saves building String objects and doing double char copies (once from input buffer to lexer text buffer and then again into a String object). Much faster.

The only time you need a text buffer is when you are not getting an exact copy of the input buffer such as n converted to newline char or when you want to strip the quote marks from the text of a string. Sometimes you want to totally rewrite the entire input text with something else. So, we need a hybrid approach where we can have Token objects that use indices and Token objects that have an actual string in them. A simple subclass will work. What is the signal in the lexer? Hmm...perhaps if I see any rule with a '!' (don't include text) suffix, I use the internal buffer. Ack. No, the '!' doesn't seem that useful after looking at a bunch of grammars. Better to make people create their own token if the text is different.

Skipping tokens

How do you say "skip"? Currently we set the token type to a special value Token.SKIP. I've always hated this non-explicit skipping mechanism. Perhaps if you don't emit anything then nextToken would look for another. Actually, that is the signal I just said we'd use for auto-emission of a token.

Damn, perhaps a special token type is the only reasonable answer. Actually, a skip() method will work. It can set a boolean that the nextToken() method uses to restart. Nope...see the Nov 6 entry above.

Filtering

How do we do that "filter=true" and "filter=aRule" functionality like ANTLR 2? Ah. Easy. When the predictor DFA for all the lexer rule returns an error, just invoke the filter rule. If you have set filter=true rather than a rule, it defaults to a rule with just '.' wildcard in it. A char is consumed and the nextToken() rule tries again. Token objects are not created for filtered text. If you don't want a token created for each whitespace token, for example, you can put the whitespace rule into the filter.

Single combined parser-lexer recognizer object?

One could argue that a single combined parser/lexer object should be generated instead of separate objects, but they are really two separate recognizers. It is also useful to insert a token stream filter between the lexer and parser. I suppose one could send in a filter to the combined object; the object could just insert the filter between the parser and the TokenSource (lexer). Hmm... in general it just seems kind of messy to combine recognizres due to conflicting definitions of "match" for token and character in the same object.