feed

Feb  13  Context-sensitive lexing by Chris Poirier • in Discussions/Conceptspermalink

I’m going to take the opportunity to elaborate on something I mentioned in Introduction to RCC Grammars): RCC-generated lexers are context-sensitive.

Specifically, RCC generates a whole set of lexers for you, one for each parser state.

Why should you care?  Well, because it means that RCC can handle declaratively what you might have had to do with custom code in the past; things like:

  • escape sequences inside strings
  • expressions nested inside strings
  • regular expressions literals
  • etc.

In other words, RCC makes it easy to embed sub-languages in your grammar.  As long as you don’t expect the same string of characters to be lexed in two ways in the same position within your grammar, RCC will ensure that it is lexed in the way that is appropriate for the context.  (And you can get past that other restriction by enabling backtracking within the generated parser.)

Consider this Ruby-like expression:


puts "Confirmed guests: {confirmed_guests.join(", ")}."

An RCC grammar like this could process it:


group expression
   addition_expression    => expression '+' expression  @associativity=left
   subtraction_expression => expression '-' expression  @associativity=left
   
   # . . . additional expression rules left as an exercise for the reader . . . 
   
   string_expression
end

section strings
   strings      
      escape_sequence => '\\' [rtn\\"{]
      general_text    => [\u0000-\uffff]-["\\{]+
   end
   
   string_expression => ‘”‘ (escape_sequence|general_text|nested_expression):element* ‘”‘
   nested_expression => ‘{’ expression ‘}’
end

It is safe to define a string like general_text (which would match just about any text in a source file), because it will only ever be used inside a string_expression (in this grammar, anyway).  And because the opening { of a nested expression is excluded from general_text, any such { inside a string_expression will trigger a shift into nested_expression.

Finally, because RCC generates an LR parser, you can still use } and even string_expressions inside expression without causing a problem, because the parser only considers reductions at the top of the stack.  In other words, a } will not cause a reduce of the nested_expression until the top of the stack matches '{' expression '}' in a state where nested_expression is expected.

Related Links

in Concepts:
in Discussions:
on site:

Discussion: No comments

Jump to comment form | comments rss | trackback uri

Leave a comment

Markdown: The kinds of formatting markup you'd use in an email will probably work here. For more details on what you can do, check out the Markdown docs.

Which is not a colour? red, green, apple, orange (required)


Site copyright 2007-2008 Chris Poirier.       Powered by Wordpress.       Entries RSS Comments RSS Validate Log in