NOTE: This is an excerpt from the in-code documentation of the parser. I thought I’d post it as an introduction on how error recovery works in RCC. I’ll add examples later.
Error recovery in rcc involves trying to alter the token stream in order to produce a valid parse. This is predicated on the idea that the user isn’t trying to produce bad code — that the errors are, in fact, accidents.
We can alter the token stream in three ways: by inserting a token; by replacing a token; or by deleting a token. Generally speaking, we trust deletion the least, as users are lazy: they don’t generally do work they didn’t think they needed. As a result, we try deletions last. Replacing a token is something we trust a little more, but only if the replacement token can be considered lexically similar to the one that was already there. For instance, we might replace “++” with “+” to produce a valid parse, or the identifer “fi” with the keyword “if”; but we should not arbitrarily replace keyword “class” with operator “*”, as it is unlikely the user meant one and typed the other. We rely on machinery in Token for this comparison. Finally, insertion of a token is our favourite choice. It seems far more likely that a user forgot to type something, than typed something extra.
The parser works with a “stack” of position markers (it is maintained as a linked list, but behaves like a stack).
Shift actions create new positions at the head of the stack, and Reduce actions remove one or more positions from
the stack and replace them with a new position. RCC limits its error recovery attempts to positions currently
on the position stack. This generally means that token stream modifications start at the error position, then
jump larger and larger distances back through the source text, looking for ways to fix the token stream. The idea
is that there is no point error correcting stuff that has already matched productions, unless as the result of a
upstream token change. Let’s consider a Ruby example: imagine that we have a dozen lines of valid code processed and reduced to statements, then encounter an unexpected “end” marker. It is unlikely all of that valid code needs reinterpretation. It parsed right once; changing it is only likely to break it. The user may have put the extra “end” in by accident; or they may have forgotten a “begin” before one of those valid statements; or they may have forgotten a “class
Fortunately, at any given position there are a limited number of token stream changes we can make without immediately creating a new error. These are the supported lookahead types from the position’s State’s action set. We error correct by inserting/replacing to one of these, or by deleting the next token and seeing what happens. And, for obvious reasons, we never do anything if the lookahead token is one the error correction system itself created.
more to come
| in Concepts: | (none) » |
| in Discussions: | (none) » |
| on site: |
Markdown: The kinds of formatting markup you'd use in an email will probably work here. For more details on what you can do, check out the Markdown docs.