Terence Parr
SED and AWK are great tools bestowed upon us from the great Uncle UNIX. They
have one serious limitation, however: the tools are line-oriented and cannot handle simple
translation problems for structured files like HTML. Consider performing an
operation on the file names in <IMG> tags. The minute a tag spans more than
one line, AWK and SED break down.
ANTLR 2.5.0 introduced an AWK-like lexical filtering
mode that forces generated lexers to ignore any characters that do not match a lexical
rule exactly. To turn ANTLR into SED, all you have to do is make a lexical filter
rule that prints out the characters that don't match anything. Then, it's up to the
lexical rules to generate what they want.
Consider the following contrived example that turns <br> and <p> tags into
their uppercase equivalents and dumps anything other than those tags to standard output:
class T extends Lexer;
options {
k=2;
filter=IGNORE;
charVocabulary = '\3'..'\177';
}
P : "<p>" {System.out.print("<P>");};
BR: "<br>" {System.out.print("<BR>");};
protected IGNORE
: ( "\r\n" | '\r' | '\n' )
{newline(); System.out.println("");}
| c:. {System.out.print(c);}
;
Rather than have a "filter=sed" option, it is simple enough to use this
idiom: put a print statement in a filter rule.