Backtracking that Worksby Richard Bartle Last time, I showed you how one-symbol-lookahead parser would fail to handle a sentence like WITH THE DIAMOND RING THE BELL. These are the kind of parsers generally created by compiler-compilers such as Unixs yacc (although some of these can be quite sophisticated nowadays and allow more flexibility). Before I continue, I ought to mention that its perfectly possible to get away with this kind of parser in a MUD, with only minor alterations. Backtracking thats all done within a single rule can still be done. For example, consider the rule fragment:
In the case where the noun can also be an adjective, the implementation of parse(r_noun_group) could easily gobble up a noun believing it to be an adjective, then fail when looking for a following noun. Within that rule itself, backtracking a word and trying to see if that is the noun is a valid option (although in practice it would probably check for the noun case before the adjective case). I know this is feasible because its how I do it in MUD2. MUD2, however, cant handle backtracking across rules. In the rule fragment
it would not be able to backtrack into the <noun phrase> if it failed at the verb stage. Im about to present the algorithm for something that can. A Fully Backtracking Parser The principle for this is the same as for the parse() function I gave in my previous column, but with one major difference: the functions parameter is a list that carries the backtracking points. Whereas parse(rule) succeeds if the parser is looking at something that matches the rule, parse(rule_list) succeeds if the parser is looking at something that matches the rule and it can get from there to the end of the line. Ill strip out the token advance() stuff from this to make it even clearer whats going on (but dont worry, Ill go into detail about token management in my next article). Ill use a right-associative operator "." to mean "put this at the head of this list", so 0.1.[2, 3, 4] would give [0, 1, 2, 3, 4]. OK, deep breath, here goes: function parse(rule_list) begin if null(rule_list) return true switch head_of(rule_list) into begin case r_input: ... case r_sentence: ... case r_command: if parse(r_command1 . tail_of(rule_list)) then return true else if parse(r_command2 . tail_of(rule_list)) then return true else return false case r_command1: ... case r_command2: if current(adverb) then if parse(rule_list) then return true if parse(r_cmd2a . tail_of(rule_list)) then return true else return false case r_cmd2a: if current(preposition) then if parse(r_noun_phrase . r_cmd2b . tail_of(rule_list)) then return true return false case r_cmd2b: if current(adverb) then if parse(rule_list) then return true if parse(r_cmd2c . tail_of(rule_list)) then return true else return false case r_cmd2c: if current(verb) then if parse(r_cmd2d . tail_of(rule_list)) then return true return false case r_cmd2d: if current(adverb) then if parse(rule_list) then return true if parse(r_cmd2e . tail_of(rule_list)) then return true else return false case r_cmd2e: if parse(r_noun_phrase . r_cmd2f . tail_of(rule_list)) then return true return false case r_cmd2f: if current(adverb) then if parse(rule_list) then return true if parse(tailof(rule_list)) then return true else return false case r_noun_phrase: ... case r_noun_group: ... end end So whats going on here? Whats going on is that rules are split up into their individual components, and each step is handled separately. When theres a need to invoke a new rule, the continuation (i.e. what to do when the rule has finished) is also passed. This way, if the continuation fails then the backtracking will go through all stages of all intermediate rules. This is best shown by the trace (adverb clutter removed again): current: WITH + parse([r_input]) | + parse([r_sentence, r_inpa]) | | + parse([r_command, r_snta, r_inpa]) | | | + parse([r_command1, r_snta, r_inpa]) | | | | + current(verb) | | | | - current(verb) =false | | | - parse([r_command1, r_snta, r_inpa]) =false | | | + parse([r_command2, r_snta, r_inpa]) | | | | + parse([r_cmd2a, r_snta, r_inpa]) | | | | | + current(preposition) | | | | | - current(preposition) =true THE | | | | | + parse([r_noun_phrase, r_cmd2b, r_snta, r_inpa]) | | | | | | + parse([r_noun_group, r_npa, r_cmd2b, r_snta, r_inpa]) | | | | | | | + current(article) | | | | | | | - current(article) =true DIAMOND | | | | | | | + parse([r_nga, r_npa, r_cmd2b, ...]) | | | | | | | | + current(integer) | | | | | | | | - current(integer) =false | | | | | | | | + parse([r_ngb, r_npa, r_cmd2b, ...]) | | | | | | | | | + current(adjective) | | | | | | | | | - current(adjective) =true RING | | | | | | | | | + parse([r_ngb, r_npa, r_cmd2b, ...]) | | | | | | | | | | + current(adjective) | | | | | | | | | | - current(adjective) =false | | | | | | | | | - parse([r_ngb, r_npa, r_cmd2b, ...]) | | | | | | | | | + parse([r_ngc, r_npa, r_cmd2b, ...]) | | | | | | | | | | + current(noun) | | | | | | | | | | - current(noun) =true THE | | | | | | | | | | + parse([r_npa, r_cmd2b, ...]) | | | | | | | | | | | + current(preposition) | | | | | | | | | | | - current(preposition) =false | | | | | | | | | | | + parse([r_cmd2b, r_snta, r_inpa]) | | | | | | | | | | | | + parse([r_cmd2c, r_snta, r_inpa]) | | | | | | | | | | | | | + current(verb) | | | | | | | | | | | | | - current(verb) =false | | | | | | | | | | | | - parse([r_cmd2c, r_snta, r_inpa]) =false | | | | | | | | | | | - parse([r_cmd2b, r_snta, r_inpa]) =false | | | | | | | | | | - parse([r_npa, r_cmd2b, ...]) RING | | | | | | | | | - parse([r_ngc, r_npa, r_cmd2b, ...]) DIAMOND | | | | | | | | - parse([r_ngb, r_npa, r_cmd2b, r_snta, r_inpa]) | | | | | | | | + parse([r_ngc, r_npa, r_cmd2b, ...]) | | | | | | | | | + current(noun) | | | | | | | | | - current(noun) =true RING | | | | | | | | | + parse([r_npa, r_cmd2b, ...]) | | | | | | | | | | + current(preposition) | | | | | | | | | | - current(preposition) =false | | | | | | | | | | + parse([r_cmd2b, r_snta, r_inpa]) | | | | | | | | | | | + parse([r_cmd2c, r_snta, r_inpa]) | | | | | | | | | | | | + current(verb) | | | | | | | | | | | | - current(verb) =true THE | | | | | | | | | | | | + parse([r_cmd2d, r_snta, r_inpa]) | | | | | | | | | | | | | + current(adverb) | | | | | | | | | | | | | - current(adverb) =false | | | | | | | | | | | | | + parse([r_cmd2e, r_snta, r_inpa]) | | | | | | | | | | | | | | + parse([r_noun_phrase, r_cmd2f, ...]) ... BELL ... EOL | | | | | | | | | | | | | | | + parse([r_cmd2f, r_snta, r_inpa]) | | | | | | | | | | | | | | | | + current(adverb) | | | | | | | | | | | | | | | | - current(adverb) =false | | | | | | | | | | | | | | | | + parse([r_snta, r_inpa]) | | | | | | | | | | | | | | | | | + current(eos) | | | | | | | | | | | | | | | | | - current(end_of_snt) =false | | | | | | | | | | | | | | | | | + parse([r_inpa]) | | | | | | | | | | | | | | | | | | + current(eos) | | | | | | | | | | | | | | | | | | - current(eos) =false | | | | | | | | | | | | | | | | | | + parse([r_inpb]) | | | | | | | | | | | | | | | | | | | + current(eol) | | | | | | | | | | | | | | | | | | | - current(eol) =true | | | | | | | | | | | | | | | | | | - parse([r_inpb]) =true | | | | | | | | | | | | | | | | | - parse([r_inpa]) =true | | | | | | | | | | | | | | | | - parse([r_snta, r_inpa]) =true | | | | | | | | | | | | | | | - parse([r_cmd2f, ...]) =true | | | | | | | | | | | | | | - parse([r_noun_phrase, ...]) =true | | | | | | | | | | | | | - parse([r_cmd2e, ...]) =true | | | | | | | | | | | | - parse([r_cmd2d, ...]) =true | | | | | | | | | | | - parse([r_cmd2c, ...]) =true | | | | | | | | | | - parse([r_cmd2b, ...]) =true | | | | | | | | | - parse([r_npa, r_cmd2b, ...]) =true | | | | | | | | - parse([r_ngc, r_npa, r_cmd2b, ...]) =true | | | | | | | - parse([r_nga, r_npa, r_cmd2b, ...]) =true | | | | | | - parse([r_noun_group, r_npa, r_cmd2b, ...]) =true | | | | | - parse([r_noun_phrase, r_cmd2b, ...]) =true | | | | - parse([r_cmd2a, r_snta, r_inpa]) =true | | | - parse([r_command2, r_snta, r_inpa]) =true | | - parse([r_command, r_snta, r_inpa]) =true | - parse([r_sentence, r_inpa]) =true - parse([r_input]) =true You will be relieved to know that this is about as complicated as these articles on parsing are going to get! (Not as relieved as I am I refuse to believe I havent made any mistakes manually generating this lot!). The key point is during parse([r_ngb, r_npa, r_cmd2b, r_snta, r_inpa]). The parser is looking at the third component of a noun group, which is a list of adjectives. It tries DIAMOND/adjective and then proceeds to finish the noun group off and go on to r_npa. After a quick check for a (non-present) preposition, r_npa is happy and passes control to r_cmd2b. Now r_cmd2b fails, but because it was invoked by r_npa it backtracks there (by the unwinding of the recursive call). Similarly, r_npa backtracks to r_ngb, where theres another choice point. The pseudocode for r_ngb, which I didnt show earlier but will do now, is: case r_ngb: if current(adjective) then if parse(rule_list) then return true if parse(r_ngc . tail_of(rule_list)) then return true else return false The first time, r_ngb assigns DIAMOND as an adjective then calls r_ngb again to see if the next word is an adjective. It isnt, so it then calls r_ngc looking for a noun. Ultimately, the subsequent parse fails. The recursion unwinds back to the point where it decided what to do if DIAMOND was an adjective, i.e. the if parse(rule_list) clause. That returns false, so control passes to parse(r_ngc . tail_of(rule_list)), which ultimately succeeds. You can now see why I did this as a switch statement rather than as function calls its easier to pass around a list of integer constants than it is a list of functors. Modern languages (indeed, some old ones like LISP) have no problems doing this, of course, but I might have problems explaining it..! You can program this up using an explicit grammar quite well, and I wouldnt even get cross with you if you did. It would cut out a lot of the near-identical lines of code and wouldnt run appreciably slower. The difference is that you cant hack at it very easily if you want to handle special cases, and the odds are you will want some special cases... OK, this has all been very exhausting so Ill stop here. Next article, Ill give the low-down on the token stream management that Ive been studiously avoiding up until now. After that, we can think about how to use what this parsing process produces. |