Parser and lexer generators are great tools and there is little reason to knit parsers or lexers (aka tokenizers) yourself.
If your (embedded) target is short on memory, the used tools have to be picked carefully, as not many of them are designed with memory-efficiency in mind.
Want to know more about them and other tools? Read on!
These are the requirements:
- Generated code depends on little or no runtime
- The grammar should be usable for generation to other languages, e.g. Java or C# with little effort
- Memory-overflow conditions should be detected and handled gracefully
- Generated code should use as little RAM as possible, possibly statically allocated
There’s a table of available tools on Wikipedia.
The parser generator: byacc
Byacc (“Berkeley Yacc”) is a really nice Yacc-implementation, which generates an LALR-parser. It doesn’t have all the features that GNU bison (another Yacc variant) offers but it does produce nicer and tighter code. No wonder, as it seems that both bison and byacc have initially been written by the same person (a Robert Corbett), with byacc having been written after bison.
Byacc basically generates a few (ROMable) tables, optional debug-code and the parsing-logic. All of its dynamic allocations happen in one function called “yygrowstack”, which could be easily changed to use static memory only.
Like bison, it can generate a “pure” (reentrant) parser and can be given context, so the same parser can have multiple instances.
One thing isn’t ideal, which is that the initial stack size (YYINITSTACKSIZE) can’t be configured. That’s a pity, because in my case an initial stack size of 32 (vs. the default 500) is enough in all relevant cases. The stack isn’t the execution stack, but the stack where the shift/reduce operations are being performed on. So this isn’t bytes, but the number of tokens on the stack.
Talking about the token stack: You should know that If you’re freeing resources connected to a token (e.g. free() on a pointer in the YYSTYPE-union) in the action code then this code may not be called on parse errors.
To change YYINITSTACKSIZE, byacc’s skeleton.c can be patched or alternatively the generated file can be changed with sed or CMake’s file/string.
This change fulfills Req #4. Stack-overflows are detected and reported (Req #3), the grammar can be used for Yacc-implementations that generate for other languages (Req #2) and the generated code only depends on memset() and realloc().
The lexer generator: Ragel
Ragel is actually more than a lexer generator, it’s a “State Machine Compiler”. Having used it for the first time, I’m sure to return to it more often, it looks great.
Even if abused for “just” generating lexers, the generated code is great. A few (ROMable) tables, the logic code and the action code interweaved and that’s all. No allocations, no dependencies, nothing.
Ragel can also generate for different languages, fulfilling Req #2.
Having them work in tandem
There’s not much more to be said, check out the example I’ve put up at GitHub.
It shows how to use them together and how to have multiple instances of a parser.
That’s it! Like always, questions and comments welcome.
I’ve looked at other tools which I eventually didn’t use. That’s what I found out about them:
Coco has a nice EBNF-syntax and supports many languages (Req #2).
It generates a recursive descent (LL-) parser, which is the kind of parser you would write by hand. Hence, Coco/R generates really nice and readable code with the action-code cleanly interweaved.
Unfortunately, the lexer is unusable (C++ variant, version 2012-01-02) for this case:
- Allocates a lot of heap (60 kB)
- Limited to wchar_t-Strings with its own string-functions (instead of std::basic_string<T>)
- Lots of allocations
- Dependency on wprintf, wcs*
Coco/R supports custom lexers, so I exchanged Coco’s lexer with a Ragel-lexer.
Eventually I found out that the generated parser may use a lot of stack, sometimes over 6 kB – yes, that can be too much.
The problem is recursive descent parsers in combination with a grammar that has lots of precedence-levels, like e.g. C-style expressions with 15 or more of them.
One precedence level results in one grammar rule which in turn results in one function call. If there are over 15 calls until you finally hit a terminal and then have a rule with parentheses which make the parser start at the top rule again and then there may be another parenthesis which… well, you get the idea.
This is where the 6 kB come from: The stack frame of one method (i.e. grammar rule) may be 64 bytes or more in size (this, parameters, local vars, registers) and the parser-input can easily result in a 100 frames deep callstack.
Another serious drawback is that there’s no built-in way of detecting stack overflows (Req #3). The stack may just overflow and that’s that.