Algorithms for Parsing
by Richard Bartle
In my previous article, I derived a grammar for a subset of English that would be suitable for use in a MUD. This time, Im going to discuss how a parser for it might be implemented. Im afraid you wont be seeing any large pieces of C or C++ or whatever your language of choice happens to be: my aim is to give you enough information that you can have a crack at writing a parser yourself; it isnt to info-dump a working parser on you and then expand on the comments in the code while deluding myself that anyone is following whats going on. Therell be pseudocode, but nothing bordering the executable.
Top-Down versus Bottom-Up
There are essentially two approaches to parsing: top-down and bottom-up. A top-down approach uses the philosophy of "know what you want and see what youve got" whereas a bottom-up approach prefers "know what youve got and see what you want".
Ill illustrate this with a simple example. Apply the following rules to the input y x z:
A top-down approach says: "I want an <a>, which starts with a <b>. A <b> starts with a y or a z. Is the first character of the sentence a y or a z? If so I have a <b>, so I can look at the next character. If there isnt one, Ive found an <a>. Otherwise, the next character had better be an x and be followed by a <b>.".
A bottom-up approach says: "I have a y. Are there any rules that start with a y? Yes, the one for <b>. OK, thats done the <b>, whats next? An x. There arent any rules that start with y then x. Are there any rules that start with <b> then x? Yes, the first one, but it wants another <b> after the x. Ill bear that in mind while I look at the next character. Aha, its a z, which is a <b>, therefore overall I have an <a>.".
In general, bottom-up is faster but gives poorer error messages; people would rather be told by the parser, "This is what I want but I dont know what Ive got" rather than, "This is what Ive got but I dont know what I want". Compilers of programming languages tend to mix the two approaches, preferring bottom-up for logical and arithmetic expressions but top-down for everything else.
For a MUD command parser, you dont need to worry about mixing top-down and bottom-up as the grammar is small and the parser doesnt really get to do a lot of work most of the time. I recommend using a top-down parser because its easier to give a decent error message when people type in something non-grammatical; its also easier to tinker with when you want to write a "special case" to handle something that the players keep on trying but that doesnt fit your grammar. If you prefer to write a bottom-up parser, thats fair enough; Ill be describing the top-down approach here, though.
There are, incidentally, tools which will automatically write you a lexical analyser and a recogniser (ie. a parser that will tell you whether an input line is grammatical or not but wont build a data structure). The best-known ones are lex and yacc, which come with the Unix C development system. Feel free to use these if you want, although it might be a good idea if you knew the general principles involved first so you can make an informed choice.
Implicit versus ExplicitThe next decision you have to make is whether to use an implicit grammar or an explicit one. In the implicit approach, you code each BNF rule as a function in a program (your parser). A recognizer for the tiny grammar I gave earlier would look something like the following (in pseudocode); it assumes a function current(token), which returns true if the token the parser is currently looking at matches its parameter:
function parse_a() begin if parse_b() then if current(token_x) then if parse_b() then return true else begin return error("<b> expected after x.") end else return true else return error("<b> expected to begin <a>.") end function parse_b() begin if current(token_y) or current(token_z) then return true end else return false end
You can see how these things can easily be computer-generated...
For clarity, Ive omitted the advance-to-next-token management, but that could be either encoded as part of current(token) or placed inline directly after a successful call to current(token). Similarly, code to retreat-to-previous-token has also been left out.
The explicit approach builds a data structure to represent the grammar and uses a generic function to traverse that. I wont go into the details here of how to build the data structure, but hopefully you can get a flavour of what it might look like from the pseudocode for a traversal function:
function parse(node) begin if null(node) then return false else if terminal(type_of(node)) then if current(token_of(node)) then return true else return false else if disjunct(type_of(node)) then //for | symbols if parse(head_of(node) then return true else return parse(tail_of(node)) else if parse(head_of(node)) then return parse(tail_of(node)) else return false end
Parsers using implicit grammars normally run faster than those using explicit ones, but for the size of parser were talking here thats no big deal. Explicit grammars are easier to alter and require less programming; furthermore, the data structure for the grammar has a nice symmetry with the data structure that the parser will build (a parse tree). However, on the whole the implicit approach is probably the better, because the error messages it allows can be more pinpoint and informative plus its easier to force it to accept something non-grammatical (warning: whatever grammar you use, its never perfect!).
So far, Ive discussed only about a third of the algorithm you need to write the parsing module. You know what a grammar looks like and you have a few clues as to how to program a recognizer for a grammar. To come is how to interface the tokenizer to the parser and how to obtain the information from the parse tree that you need to pass on to the binder. Ill start on those next time.