Notes from the Dawn of Time

Notes from the Dawn of Time #16:

Backtracking that Works

by Richard Bartle
April 10, 2002

Last time, I showed you how one-symbol-lookahead parser would fail to handle a sentence like WITH THE DIAMOND RING THE BELL. These are the kind of parsers generally created by compiler-compilers such as Unix’s yacc (although some of these can be quite sophisticated nowadays and allow more flexibility).

Before I continue, I ought to mention that it’s perfectly possible to get away with this kind of parser in a MUD, with only minor alterations. Backtracking that’s all done within a single rule can still be done. For example, consider the rule fragment:

{ adjective } noun

In the case where the noun can also be an adjective, the implementation of parse(r_noun_group) could easily gobble up a noun believing it to be an adjective, then fail when looking for a following noun. Within that rule itself, backtracking a word and trying to see if that is the noun is a valid option (although in practice it would probably check for the noun case before the adjective case).

I know this is feasible because it’s how I do it in MUD2. MUD2, however, can’t handle backtracking across rules. In the rule fragment

preposition <noun phrase> verb

it would not be able to backtrack into the <noun phrase> if it failed at the verb stage.

I’m about to present the algorithm for something that can.

A Fully Backtracking Parser

The principle for this is the same as for the parse() function I gave in my previous column, but with one major difference: the function’s parameter is a list that carries the backtracking points. Whereas parse(rule) succeeds if the parser is looking at something that matches the rule, parse(rule_list) succeeds if the parser is looking at something that matches the rule and it can get from there to the end of the line.

I’ll strip out the token advance() stuff from this to make it even clearer what’s going on (but don’t worry, I’ll go into detail about token management in my next article). I’ll use a right-associative operator "." to mean "put this at the head of this list", so 0.1.[2, 3, 4] would give [0, 1, 2, 3, 4].

OK, deep breath, here goes:

function parse(rule_list)
begin
   if null(rule_list) return true
   switch head_of(rule_list) into
   begin
   case r_input:
      ...
   case r_sentence:
      ...
   case r_command:
      if parse(r_command1 . tail_of(rule_list)) then
         return true
      else if parse(r_command2 . tail_of(rule_list)) then
         return true
      else
         return false
   case r_command1:
      ...
   case r_command2:
      if current(adverb) then
         if parse(rule_list) then
return true
      if parse(r_cmd2a . tail_of(rule_list)) then
         return true
      else
         return false
   case r_cmd2a:
   if current(preposition) then
      if parse(r_noun_phrase . r_cmd2b . tail_of(rule_list)) then
         return true
   return false
case r_cmd2b:
      if current(adverb) then
         if parse(rule_list) then
return true
      if parse(r_cmd2c . tail_of(rule_list)) then
         return true
      else
         return false
   case r_cmd2c:
   if current(verb) then
      if parse(r_cmd2d . tail_of(rule_list)) then
         return true
   return false
   case r_cmd2d:
      if current(adverb) then
         if parse(rule_list) then
return true
      if parse(r_cmd2e . tail_of(rule_list)) then
         return true
      else
         return false
case r_cmd2e:
if parse(r_noun_phrase . r_cmd2f . tail_of(rule_list)) then
         return true
   return false
case r_cmd2f:
      if current(adverb) then
         if parse(rule_list) then
return true
      if parse(tailof(rule_list)) then
         return true
      else
         return false
   case r_noun_phrase:
      ...
   case r_noun_group:
      ...
   end
end

So what’s going on here?

What’s going on is that rules are split up into their individual components, and each step is handled separately. When there’s a need to invoke a new rule, the continuation (i.e. what to do when the rule has finished) is also passed. This way, if the continuation fails then the backtracking will go through all stages of all intermediate rules.

This is best shown by the trace (adverb clutter removed again):

                                                            current:
                                                            WITH
+ parse([r_input])
| + parse([r_sentence, r_inpa])
| | + parse([r_command, r_snta, r_inpa])
| | | + parse([r_command1, r_snta, r_inpa])
| | | | + current(verb)
| | | | - current(verb) =false
| | | - parse([r_command1, r_snta, r_inpa]) =false
| | | + parse([r_command2, r_snta, r_inpa])
| | | | + parse([r_cmd2a, r_snta, r_inpa])
| | | | | + current(preposition)
| | | | | - current(preposition) =true                      THE
| | | | | + parse([r_noun_phrase, r_cmd2b, r_snta, r_inpa])
| | | | | | + parse([r_noun_group, r_npa, r_cmd2b, r_snta, r_inpa])
| | | | | | | + current(article)
| | | | | | | - current(article) =true                      DIAMOND
| | | | | | | + parse([r_nga, r_npa, r_cmd2b, ...])
| | | | | | | | + current(integer)
| | | | | | | | - current(integer) =false
| | | | | | | | + parse([r_ngb, r_npa, r_cmd2b, ...])
| | | | | | | | | + current(adjective)
| | | | | | | | | - current(adjective) =true                RING
| | | | | | | | | + parse([r_ngb, r_npa, r_cmd2b, ...])
| | | | | | | | | | + current(adjective)
| | | | | | | | | | - current(adjective) =false
| | | | | | | | | - parse([r_ngb, r_npa, r_cmd2b, ...])
| | | | | | | | | + parse([r_ngc, r_npa, r_cmd2b, ...])
| | | | | | | | | | + current(noun)
| | | | | | | | | | - current(noun) =true                   THE
| | | | | | | | | | + parse([r_npa, r_cmd2b, ...])
| | | | | | | | | | | + current(preposition)
| | | | | | | | | | | - current(preposition) =false
| | | | | | | | | | | + parse([r_cmd2b, r_snta, r_inpa])
| | | | | | | | | | | | + parse([r_cmd2c, r_snta, r_inpa])
| | | | | | | | | | | | | + current(verb)
| | | | | | | | | | | | | - current(verb) =false
| | | | | | | | | | | | - parse([r_cmd2c, r_snta, r_inpa]) =false
| | | | | | | | | | | - parse([r_cmd2b, r_snta, r_inpa]) =false
| | | | | | | | | | - parse([r_npa, r_cmd2b, ...])          RING
| | | | | | | | | - parse([r_ngc, r_npa, r_cmd2b, ...])     DIAMOND
| | | | | | | | - parse([r_ngb, r_npa, r_cmd2b, r_snta, r_inpa]) 
| | | | | | | | + parse([r_ngc, r_npa, r_cmd2b, ...])
| | | | | | | | | + current(noun)
| | | | | | | | | - current(noun) =true                     RING
| | | | | | | | | + parse([r_npa, r_cmd2b, ...])
| | | | | | | | | | + current(preposition)
| | | | | | | | | | - current(preposition) =false
| | | | | | | | | | + parse([r_cmd2b, r_snta, r_inpa])
| | | | | | | | | | | + parse([r_cmd2c, r_snta, r_inpa])
| | | | | | | | | | | | + current(verb)
| | | | | | | | | | | | - current(verb) =true               THE
| | | | | | | | | | | | + parse([r_cmd2d, r_snta, r_inpa])
| | | | | | | | | | | | | + current(adverb)
| | | | | | | | | | | | | - current(adverb) =false
| | | | | | | | | | | | | + parse([r_cmd2e, r_snta, r_inpa])
| | | | | | | | | | | | | | + parse([r_noun_phrase, r_cmd2f, ...])
...                                                         BELL
...                                                         EOL
| | | | | | | | | | | | | | | + parse([r_cmd2f, r_snta, r_inpa])
| | | | | | | | | | | | | | | | + current(adverb)
| | | | | | | | | | | | | | | | - current(adverb) =false
| | | | | | | | | | | | | | | | + parse([r_snta, r_inpa])
| | | | | | | | | | | | | | | | | + current(eos)
| | | | | | | | | | | | | | | | | - current(end_of_snt) =false
| | | | | | | | | | | | | | | | | + parse([r_inpa])
| | | | | | | | | | | | | | | | | | + current(eos)
| | | | | | | | | | | | | | | | | | - current(eos) =false
| | | | | | | | | | | | | | | | | | + parse([r_inpb])
| | | | | | | | | | | | | | | | | | | + current(eol)
| | | | | | | | | | | | | | | | | | | - current(eol) =true
| | | | | | | | | | | | | | | | | | - parse([r_inpb]) =true
| | | | | | | | | | | | | | | | | - parse([r_inpa]) =true
| | | | | | | | | | | | | | | | - parse([r_snta, r_inpa]) =true
| | | | | | | | | | | | | | | - parse([r_cmd2f, ...]) =true
| | | | | | | | | | | | | | - parse([r_noun_phrase, ...]) =true
| | | | | | | | | | | | | - parse([r_cmd2e, ...]) =true
| | | | | | | | | | | | - parse([r_cmd2d, ...]) =true
| | | | | | | | | | | - parse([r_cmd2c, ...]) =true
| | | | | | | | | | - parse([r_cmd2b, ...]) =true
| | | | | | | | | - parse([r_npa, r_cmd2b, ...]) =true
| | | | | | | | - parse([r_ngc, r_npa, r_cmd2b, ...]) =true
| | | | | | | - parse([r_nga, r_npa, r_cmd2b, ...]) =true
| | | | | | - parse([r_noun_group, r_npa, r_cmd2b, ...]) =true
| | | | | - parse([r_noun_phrase, r_cmd2b, ...]) =true
| | | | - parse([r_cmd2a, r_snta, r_inpa]) =true
| | | - parse([r_command2, r_snta, r_inpa]) =true
| | - parse([r_command, r_snta, r_inpa]) =true
| - parse([r_sentence, r_inpa]) =true
- parse([r_input]) =true

You will be relieved to know that this is about as complicated as these articles on parsing are going to get!

(Not as relieved as I am — I refuse to believe I haven’t made any mistakes manually generating this lot!).

The key point is during parse([r_ngb, r_npa, r_cmd2b, r_snta, r_inpa]). The parser is looking at the third component of a noun group, which is a list of adjectives. It tries DIAMOND/adjective and then proceeds to finish the noun group off and go on to r_npa. After a quick check for a (non-present) preposition, r_npa is happy and passes control to r_cmd2b. Now r_cmd2b fails, but because it was invoked by r_npa it backtracks there (by the unwinding of the recursive call). Similarly, r_npa backtracks to r_ngb, where there’s another choice point. The pseudocode for r_ngb, which I didn’t show earlier but will do now, is:

case r_ngb:
      if current(adjective) then
         if parse(rule_list) then
return true
      if parse(r_ngc . tail_of(rule_list)) then
         return true
      else
         return false

The first time, r_ngb assigns DIAMOND as an adjective then calls r_ngb again to see if the next word is an adjective. It isn’t, so it then calls r_ngc looking for a noun. Ultimately, the subsequent parse fails. The recursion unwinds back to the point where it decided what to do if DIAMOND was an adjective, i.e. the if parse(rule_list) clause. That returns false, so control passes to parse(r_ngc . tail_of(rule_list)), which ultimately succeeds.

You can now see why I did this as a switch statement rather than as function calls — it’s easier to pass around a list of integer constants than it is a list of functors. Modern languages (indeed, some old ones like LISP) have no problems doing this, of course, but I might have problems explaining it..!

You can program this up using an explicit grammar quite well, and I wouldn’t even get cross with you if you did. It would cut out a lot of the near-identical lines of code and wouldn’t run appreciably slower. The difference is that you can’t hack at it very easily if you want to handle special cases, and the odds are you will want some special cases...

OK, this has all been very exhausting so I’ll stop here. Next article, I’ll give the low-down on the token stream management that I’ve been studiously avoiding up until now. After that, we can think about how to use what this parsing process produces.

Recent Discussions on Notes from the Dawn of Time: