Notes from the Dawn of Time

Notes from the Dawn of Time #15:

Backtracking

by Richard Bartle
March 13, 2002

Time flies like an arrow, but fruit flies like a banana.

This is an example of a garden path sentence, so called because the reader is led up the garden path (ie. sent in a wrong direction). Here are some more:

The old man the boats.

The horse raced across the heath fell.

When I ran a mile seemed a long distance.

The granite rocks in an earthquake.

Modern computer languages are designed so as to make parsing them easy. They don't have any ambiguity, therefore when part of the input stream has been parsed there's never any reason to go back and reparse it. Natural language, however, is ambiguous. Sometimes a parser has several possible ways to proceed and it chooses the wrong one. When it discovers its mistake, it has to backtrack and try another possibility.

A (contrived) MUD example: WITH THE DIAMOND RING THE BELL. Here, DIAMOND can be both an adjective and a noun, and RING can be both a noun and a verb. The parser successfully assigns WITH/preposition THE/article DIAMOND/adjective RING/noun then comes across THE while expecting a verb or an adverb. It can't be solved by having the parser try its rules in a different order, because then WITH THE DIAMOND RING RING THE BELL would fail. There's no way out of it: the parser has to backtrack.

The parsing strategy I recommended in my previous article was the top-down, implicit approach. This is implemented most naturally as a left-to-right recursive descent, ie. you take each possible | ("or") choice in order left-to-right and try the next one if the current one fails. Don't let the "recursive" bit put you off – it’s mainly iterative because the grammar you'll use will almost certainly be reducible in most places to tail recursion (which is iteration).

The tricky bit about this is actually keeping track of where to when you have to backtrack. There are two things you need to know: which token you should consider to be the one currently under consideration; which choice point you are at.

The token part is easy. You can think of the result of tokenisation being an array of tokens (even if it's actually a stream of them), in the order they appeared in the input line. Advancing by a token is like incrementing a counter, and retreating is like decrementing it. You can store and recover your current position easily, too.

The backtrack point is hard. I'd better give an example to show you what I'm babbling on about...

An Example

Let's take the sentence I constructed earlier and number the tokens:

Word	WITH	THE	DIAMOND	RING	THE	BELL
Number	1	2	3	4	5	6
PoS	prep	article	adjective noun	noun verb	article	noun

I've marked this with the possible parts of speech (PoS) that the words could take on. Like most nouns, RING could probably also be an adjective (THE RING SHOP) as could BELL (which could additionally be an unlikely verb). However, for the purposes of this example we'll stick with the above simple alternatives.

To access the token array we'll use a function that returns either false (its parameter does not match the token we're looking at) or true (it does). I've been calling this function current(token), so I'll stick with that. I ought also to use some other functions to manage the index into the token array, but I'm only going to put in the one that advances to the next token; the rest are, for the moment, omitted for brevity.

First, I'll show you what happens when things are done the same way as in a computer language parser, ie. not how we want to do it. Suppose that for each rule of the grammar (of which there were only 5 in the one I gave a couple of articles ago) we have a separate case of a switch statement in a function called parse(). You could code the rules as individual functions if you liked, as I demonstrated in my previous article, but I'm choosing to do it this way for a reason that will become apparent later. For simple ease of explanation, I'll split the <command> rule in two, one for each side of the | symbol. A skeletal pseudocode for the function that has the <command2> case expanded in full might look as follows:

function parse(rule_type r)
begin
   switch r into
   begin
   case r_input:
      ...
   case r_sentence:
      ...
   case r_command:
      if parse(r_command1) then
         return true
      else if parse(r_command2) then
         return true
      else
         return false
      case r_command1:
         ...
      case r_command2:
         while current(adverb) do advance()
         if current(preposition) then
         begin
            advance()
            if parse(r_noun_phrase) then
            begin
               while current(adverb) do advance()
               if current(verb) then
               begin
                  advance()
                  while current(adverb) do advance()
                  if parse(r_noun_phrase) then
                  begin
                     while current(adverb) do advance()
                     return true
                  end else
                     return false
               end else
                  return false
            end else
               return false
            end else
            return false
   case r_noun_phrase:
      ...
   case r_noun_group:
      ...
   end
end

Yes, sorry that's a little drawn out; I do realise there dotare much more compact ways to code it but I'm trying to be clear here, not clever.This is a one-symbol lookahead approach that works fine when there is no ambiguity. From where you are and how you got there, you always know where to go. Our WITH THE DIAMOND RING THE BELL example completely flummoxes it, though. Here is a trace of how the parse fails (ignoring the tedious adverb checks that in this example never find an adverb anyway). I've marked changes to the current token under consideration when they occur (which is following a successful call to current()):

                                            current:
    
                                            WITH

+ parse(r_input)

| + parse(r_sentence)

| | + parse(r_command)

| | | + parse(r_command1)

| | | | + current(verb)

| | | | - current(verb) =false

| | | - parse(r_command1) =false

| | | + parse(r_command2)

| | | | + current(preposition)

| | | | - current(preposition) =true        THE

| | | | + parse(noun_phrase)

| | | | | + parse(r_noun_group)

| | | | | | + current(article)

| | | | | | - current(article) =true        DIAMOND

| | | | | | + current(integer)

| | | | | | - current(integer) =false

| | | | | | + current(adjective)

| | | | | | - current(adjective) =true      RING

| | | | | | + current(adjective)

| | | | | | - current(adjective) =false

| | | | | | + current(noun)

| | | | | | - current(noun) =true           THE

| | | | | - parse(r_noun_group) =true

| | | | | + current(preposition)

| | | | | - current(preposition) =false

| | | | - parse(r_noun_phrase) =true

| | | | + current(verb)

| | | | - current(verb) =false

Crunch!

At this point, the parser has found a preposition followed by a <noun phrase> and is now looking for a verb. It doesn't find one. What it ought to do is back up through all its previous decisions and try each one again in turn. What it actually does is:

| | | - parse(r_command2) =false    WITH

| | - parse(r_command) =false

| - parse(r_sentence) =false

- parse_input =false

In other words, it fails to parse an input line. This isn't because it doesn't know where to look in the token stream; rather, it's because it doesn't have access to the right choice point where it can try again. What you really want is for parse(r_command) to be able to call parse(r_noun_phrase) such that it produces a different result from the one it did the first time. Unfortunately, it won't.

OK, next time I'll show you what parse_noun_group() should look like.

Recent Discussions on Notes from the Dawn of Time: