Notes from the Dawn of Time

Notes from the Dawn of Time #14:

Algorithms for Parsing

by Richard Bartle
March 13, 2002

In my previous article, I derived a grammar for a subset of English that would be suitable for use in a MUD. This time, I’m going to discuss how a parser for it might be implemented. I’m afraid you won’t be seeing any large pieces of C or C++ or whatever your language of choice happens to be: my aim is to give you enough information that you can have a crack at writing a parser yourself; it isn’t to info-dump a working parser on you and then expand on the comments in the code while deluding myself that anyone is following what’s going on. There’ll be pseudocode, but nothing bordering the executable.

Top-Down versus Bottom-Up

There are essentially two approaches to parsing: top-down and bottom-up. A top-down approach uses the philosophy of "know what you want and see what you’ve got" whereas a bottom-up approach prefers "know what you’ve got and see what you want".

I’ll illustrate this with a simple example. Apply the following rules to the input y x z:

<a>::= <b> [ x <b> ]

<b>::= y | z

A top-down approach says: "I want an <a>, which starts with a . A starts with a y or a z. Is the first character of the sentence a y or a z? If so I have a , so I can look at the next character. If there isn’t one, I’ve found an <a>. Otherwise, the next character had better be an x and be followed by a .".

A bottom-up approach says: "I have a y. Are there any rules that start with a y? Yes, the one for . OK, that’s done the , what’s next? An x. There aren’t any rules that start with y then x. Are there any rules that start with then x? Yes, the first one, but it wants another after the x. I’ll bear that in mind while I look at the next character. Aha, it’s a z, which is a , therefore overall I have an <a>.".

In general, bottom-up is faster but gives poorer error messages; people would rather be told by the parser, "This is what I want but I don’t know what I’ve got" rather than, "This is what I’ve got but I don’t know what I want". Compilers of programming languages tend to mix the two approaches, preferring bottom-up for logical and arithmetic expressions but top-down for everything else.

For a MUD command parser, you don’t need to worry about mixing top-down and bottom-up as the grammar is small and the parser doesn’t really get to do a lot of work most of the time. I recommend using a top-down parser because it’s easier to give a decent error message when people type in something non-grammatical; it’s also easier to tinker with when you want to write a "special case" to handle something that the players keep on trying but that doesn’t fit your grammar. If you prefer to write a bottom-up parser, that’s fair enough; I’ll be describing the top-down approach here, though.

There are, incidentally, tools which will automatically write you a lexical analyser and a recogniser (ie. a parser that will tell you whether an input line is grammatical or not but won’t build a data structure). The best-known ones are lex and yacc, which come with the Unix C development system. Feel free to use these if you want, although it might be a good idea if you knew the general principles involved first so you can make an informed choice.

Implicit versus Explicit

The next decision you have to make is whether to use an implicit grammar or an explicit one. In the implicit approach, you code each BNF rule as a function in a program (your parser). A recognizer for the tiny grammar I gave earlier would look something like the following (in pseudocode); it assumes a function current(token), which returns true if the token the parser is currently looking at matches its parameter:

function parse_a()
begin
    if parse_b() then
        if current(token_x) then
            if parse_b() then
                return true
            else
            begin
                return error("<b> expected after x.")
            end
        else
    return true
    else
        return error("<b> expected to begin <a>.")
end

function parse_b()
begin
    if current(token_y) or current(token_z) then
        return true
    end else
        return false
end

You can see how these things can easily be computer-generated...

For clarity, I’ve omitted the advance-to-next-token management, but that could be either encoded as part of current(token) or placed inline directly after a successful call to current(token). Similarly, code to retreat-to-previous-token has also been left out.

The explicit approach builds a data structure to represent the grammar and uses a generic function to traverse that. I won’t go into the details here of how to build the data structure, but hopefully you can get a flavour of what it might look like from the pseudocode for a traversal function:

function parse(node)
begin
    if null(node) then
        return false
    else
    if terminal(type_of(node)) then
        if current(token_of(node)) then
            return true
        else
            return false
    else
    if disjunct(type_of(node)) then     //for | symbols
        if parse(head_of(node) then
            return true
        else
            return parse(tail_of(node))
    else
        if parse(head_of(node)) then
            return parse(tail_of(node))
        else
            return false
end

Parsers using implicit grammars normally run faster than those using explicit ones, but for the size of parser we’re talking here that’s no big deal. Explicit grammars are easier to alter and require less programming; furthermore, the data structure for the grammar has a nice symmetry with the data structure that the parser will build (a parse tree). However, on the whole the implicit approach is probably the better, because the error messages it allows can be more pinpoint and informative — plus it’s easier to force it to accept something non-grammatical (warning: whatever grammar you use, it’s never perfect!).

So far, I’ve discussed only about a third of the algorithm you need to write the parsing module. You know what a grammar looks like and you have a few clues as to how to program a recognizer for a grammar. To come is how to interface the tokenizer to the parser and how to obtain the information from the parse tree that you need to pass on to the binder. I’ll start on those next time.

Recent Discussions on Notes from the Dawn of Time: