by Richard Bartle
Time flies like an arrow, but fruit flies like a banana.
This is an example of a garden path sentence, so called because the reader is led up the garden path (ie. sent in a wrong direction). Here are some more:
The old man the boats.
The horse raced across the heath fell.
When I ran a mile seemed a long distance.
The granite rocks in an earthquake.
Modern computer languages are designed so as to make parsing them easy. They don't have any ambiguity, therefore when part of the input stream has been parsed there's never any reason to go back and reparse it. Natural language, however, is ambiguous. Sometimes a parser has several possible ways to proceed and it chooses the wrong one. When it discovers its mistake, it has to backtrack and try another possibility.
A (contrived) MUD example: WITH THE DIAMOND RING THE BELL. Here, DIAMOND can be both an adjective and a noun, and RING can be both a noun and a verb. The parser successfully assigns WITH/preposition THE/article DIAMOND/adjective RING/noun then comes across THE while expecting a verb or an adverb. It can't be solved by having the parser try its rules in a different order, because then WITH THE DIAMOND RING RING THE BELL would fail. There's no way out of it: the parser has to backtrack.
The parsing strategy I recommended in my previous article was the top-down, implicit approach. This is implemented most naturally as a left-to-right recursive descent, ie. you take each possible | ("or") choice in order left-to-right and try the next one if the current one fails. Don't let the "recursive" bit put you off its mainly iterative because the grammar you'll use will almost certainly be reducible in most places to tail recursion (which is iteration).
The tricky bit about this is actually keeping track of where to when you have to backtrack. There are two things you need to know: which token you should consider to be the one currently under consideration; which choice point you are at.
The token part is easy. You can think of the result of tokenisation being an array of tokens (even if it's actually a stream of them), in the order they appeared in the input line. Advancing by a token is like incrementing a counter, and retreating is like decrementing it. You can store and recover your current position easily, too.
The backtrack point is hard. I'd better give an example to show you what I'm babbling on about...
Let's take the sentence I constructed earlier and number the tokens:
I've marked this with the possible parts of speech (PoS) that the words could take on. Like most nouns, RING could probably also be an adjective (THE RING SHOP) as could BELL (which could additionally be an unlikely verb). However, for the purposes of this example we'll stick with the above simple alternatives.
To access the token array we'll use a function that returns either false (its parameter does not match the token we're looking at) or true (it does). I've been calling this function current(token), so I'll stick with that. I ought also to use some other functions to manage the index into the token array, but I'm only going to put in the one that advances to the next token; the rest are, for the moment, omitted for brevity.
First, I'll show you what happens when things are done the same way as in a computer language parser, ie. not how we want to do it. Suppose that for each rule of the grammar (of which there were only 5 in the one I gave a couple of articles ago) we have a separate case of a switch statement in a function called parse(). You could code the rules as individual functions if you liked, as I demonstrated in my previous article, but I'm choosing to do it this way for a reason that will become apparent later. For simple ease of explanation, I'll split the <command> rule in two, one for each side of the | symbol. A skeletal pseudocode for the function that has the <command2> case expanded in full might look as follows:
function parse(rule_type r) begin switch r into begin case r_input: ... case r_sentence: ... case r_command: if parse(r_command1) then return true else if parse(r_command2) then return true else return false case r_command1: ... case r_command2: while current(adverb) do advance() if current(preposition) then begin advance() if parse(r_noun_phrase) then begin while current(adverb) do advance() if current(verb) then begin advance() while current(adverb) do advance() if parse(r_noun_phrase) then begin while current(adverb) do advance() return true end else return false end else return false end else return false end else return false case r_noun_phrase: ... case r_noun_group: ... end end
Yes, sorry that's a little drawn out; I do realise there dotare much more compact ways to code it but I'm trying to be clear here, not clever.This is a one-symbol lookahead approach that works fine when there is no ambiguity. From where you are and how you got there, you always know where to go. Our WITH THE DIAMOND RING THE BELL example completely flummoxes it, though. Here is a trace of how the parse fails (ignoring the tedious adverb checks that in this example never find an adverb anyway). I've marked changes to the current token under consideration when they occur (which is following a successful call to current()):
At this point, the parser has found a preposition followed by a <noun phrase> and is now looking for a verb. It doesn't find one. What it ought to do is back up through all its previous decisions and try each one again in turn. What it actually does is:
| | | - parse(r_command2) =false WITH
In other words, it fails to parse an input line. This isn't because it doesn't know where to look in the token stream; rather, it's because it doesn't have access to the right choice point where it can try again. What you really want is for parse(r_command) to be able to call parse(r_noun_phrase) such that it produces a different result from the one it did the first time. Unfortunately, it won't.
OK, next time I'll show you what parse_noun_group() should look like.