Algorithms for Parsingby Richard Bartle In my previous article, I derived a grammar for a subset of English that would be suitable for use in a MUD. This time, I’m going to discuss how a parser for it might be implemented. I’m afraid you won’t be seeing any large pieces of C or C++ or whatever your language of choice happens to be: my aim is to give you enough information that you can have a crack at writing a parser yourself; it isn’t to info-dump a working parser on you and then expand on the comments in the code while deluding myself that anyone is following what’s going on. There’ll be pseudocode, but nothing bordering the executable. Top-Down versus Bottom-UpThere are essentially two approaches to parsing: top-down and bottom-up. A top-down approach uses the philosophy of "know what you want and see what you’ve got" whereas a bottom-up approach prefers "know what you’ve got and see what you want". I’ll illustrate this with a simple example. Apply the following rules to the input y x z:
A top-down approach says: "I want an <a>, which starts with a <b>. A <b> starts with a y or a z. Is the first character of the sentence a y or a z? If so I have a <b>, so I can look at the next character. If there isn’t one, I’ve found an <a>. Otherwise, the next character had better be an x and be followed by a <b>.". A bottom-up approach says: "I have a y. Are there any rules that start with a y? Yes, the one for <b>. OK, that’s done the <b>, what’s next? An x. There aren’t any rules that start with y then x. Are there any rules that start with <b> then x? Yes, the first one, but it wants another <b> after the x. I’ll bear that in mind while I look at the next character. Aha, it’s a z, which is a <b>, therefore overall I have an <a>.". In general, bottom-up is faster but gives poorer error messages; people would rather be told by the parser, "This is what I want but I don’t know what I’ve got" rather than, "This is what I’ve got but I don’t know what I want". Compilers of programming languages tend to mix the two approaches, preferring bottom-up for logical and arithmetic expressions but top-down for everything else. For a MUD command parser, you don’t need to worry about mixing top-down and bottom-up as the grammar is small and the parser doesn’t really get to do a lot of work most of the time. I recommend using a top-down parser because it’s easier to give a decent error message when people type in something non-grammatical; it’s also easier to tinker with when you want to write a "special case" to handle something that the players keep on trying but that doesn’t fit your grammar. If you prefer to write a bottom-up parser, that’s fair enough; I’ll be describing the top-down approach here, though. There are, incidentally, tools which will automatically write you a lexical analyser and a recogniser (ie. a parser that will tell you whether an input line is grammatical or not but won’t build a data structure). The best-known ones are lex and yacc, which come with the Unix C development system. Feel free to use these if you want, although it might be a good idea if you knew the general principles involved first so you can make an informed choice. Implicit versus ExplicitThe next decision you have to make is whether to use an implicit grammar or an explicit one. In the implicit approach, you code each BNF rule as a function in a program (your parser). A recognizer for the tiny grammar I gave earlier would look something like the following (in pseudocode); it assumes a function current(token), which returns true if the token the parser is currently looking at matches its parameter:function parse_a() begin if parse_b() then if current(token_x) then if parse_b() then return true else begin return error("<b> expected after x.") end else return true else return error("<b> expected to begin <a>.") end function parse_b() begin if current(token_y) or current(token_z) then return true end else return false end You can see how these things can easily be computer-generated... For clarity, I’ve omitted the advance-to-next-token management, but that could be either encoded as part of current(token) or placed inline directly after a successful call to current(token). Similarly, code to retreat-to-previous-token has also been left out. The explicit approach builds a data structure to represent the grammar and uses a generic function to traverse that. I won’t go into the details here of how to build the data structure, but hopefully you can get a flavour of what it might look like from the pseudocode for a traversal function: function parse(node) begin if null(node) then return false else if terminal(type_of(node)) then if current(token_of(node)) then return true else return false else if disjunct(type_of(node)) then //for | symbols if parse(head_of(node) then return true else return parse(tail_of(node)) else if parse(head_of(node)) then return parse(tail_of(node)) else return false end Parsers using implicit grammars normally run faster than those using explicit ones, but for the size of parser we’re talking here that’s no big deal. Explicit grammars are easier to alter and require less programming; furthermore, the data structure for the grammar has a nice symmetry with the data structure that the parser will build (a parse tree). However, on the whole the implicit approach is probably the better, because the error messages it allows can be more pinpoint and informative — plus it’s easier to force it to accept something non-grammatical (warning: whatever grammar you use, it’s never perfect!). So far, I’ve discussed only about a third of the algorithm you need to write the parsing module. You know what a grammar looks like and you have a few clues as to how to program a recognizer for a grammar. To come is how to interface the tokenizer to the parser and how to obtain the information from the parse tree that you need to pass on to the binder. I’ll start on those next time. |