feed

Mar  22  Introduction to RCC Error Recovery by Chris Poirier • in Discussions/Conceptspermalink

NOTE: This is an excerpt from the in-code documentation of the parser.  I thought I’d post it as an introduction on how error recovery works in RCC.  I’ll add examples later.

Error recovery in rcc involves trying to alter the token stream in order to produce a valid parse.  This is predicated on the idea that the user isn’t trying to produce bad code — that the errors are, in fact, accidents.

We can alter the token stream in three ways: by inserting a token; by replacing a token; or by deleting a token. Generally speaking, we trust deletion the least, as users are lazy: they don’t generally do work they didn’t think they needed.  As a result, we try deletions last.  Replacing a token is something we trust a little more, but only if the replacement token can be considered lexically similar to the one that was already there.  For instance, we might replace “++” with “+” to produce a valid parse, or the identifer “fi” with the keyword “if”; but we should not arbitrarily replace keyword “class” with operator “*”, as it is unlikely the user meant one and typed the other. We rely on machinery in Token for this comparison.  Finally, insertion of a token is our favourite choice.  It seems far more likely that a user forgot to type something, than typed something extra.

The parser works with a “stack” of position markers (it is maintained as a linked list, but behaves like a stack). Shift actions create new positions at the head of the stack, and Reduce actions remove one or more positions from the stack and replace them with a new position.  RCC limits its error recovery attempts to positions currently on the position stack.  This generally means that token stream modifications start at the error position, then jump larger and larger distances back through the source text, looking for ways to fix the token stream.  The idea is that there is no point error correcting stuff that has already matched productions, unless as the result of a upstream token change.  Let’s consider a Ruby example: imagine that we have a dozen lines of valid code processed and reduced to statements, then encounter an unexpected “end” marker.  It is unlikely all of that valid code needs reinterpretation.  It parsed right once; changing it is only likely to break it.  The user may have put the extra “end” in by accident; or they may have forgotten a “begin” before one of those valid statements; or they may have forgotten a “class ” or something similar at the very top of the file.  By error correcting only at positions still on the stack, we conveniently leap to those suspicious points.

Fortunately, at any given position there are a limited number of token stream changes we can make without immediately creating a new error.  These are the supported lookahead types from the position’s State’s action set.  We error correct by inserting/replacing to one of these, or by deleting the next token and seeing what happens. And, for obvious reasons, we never do anything if the lookahead token is one the error correction system itself created.

more to come

Related Links

in Concepts:
in Discussions:
on site:

Discussion: No comments

Jump to comment form | comments rss | trackback uri

Leave a comment

Markdown: The kinds of formatting markup you'd use in an email will probably work here. For more details on what you can do, check out the Markdown docs.

Which is not a fruit? carrot, apple, banana (required)


Site copyright 2007-2008 Chris Poirier.       Powered by Wordpress.       Entries RSS Comments RSS Validate Log in