Notes from the Dawn of Time

Notes from the Dawn of Time #13:

Parsing

by Richard Bartle
February 27, 2002

It’s finally time for the interesting part: how the parsing process works. I’ll explain this informally, first, talking about implementation later.

The basic idea is to split an input line into self-contained, grammatical chunks until you reach a level where the chunks are individual tokens. While so doing, you build a structure to represent an input; from this structure you can then read the data required to determine the actual commands that are to be executed (or, more properly, for the final, binder phase to determine them). I won’t worry about the data structure part just yet; rather, I’ll concentrate on the way input lines can be broken down.

You’ll notice I keep saying "input lines" and not "sentences". This is because people can type lines that contain several sentences: E. OPEN DOOR. IN. CLOSE DOOR. KILL ZOMBIE WITH AXE. The first thing to do, therefore, is to partition the input line up into individual sentences. That’s fairly easy, as sentences can only end in full stops, question marks, exclamation marks and end-of-lines (again, I apologize to American readers for my use of standard English terms here).

So now we have sentences, but we still don’t have commands. Commands are unitary actions — they have one verb, and one verb only. Sentences can have several verbs: OPEN DOOR THEN THROW AXE THROUGH DOOR THEN CLOSE DOOR has three (OPEN, THROW, CLOSE). Such a sentence needs to be split into commands (or verb phrases) at its conjunctions.

Whereas with punctuation there was no ambiguity, with conjunctions there is. The word "but", for example, can function as a conjunction (GO WEST BUT STAY ON THE PATH), a preposition (DROP ALL BUT THE GOLD) or archaically as an adverb (OPEN THE DOOR BUT SLOWLY). OK, well, never mind, let’s assume that we’ll figure out some way of dealing with that later and carry on.

Commands

We have now reached the level where we’re looking at commands. At this point, it’s a good idea to write out a long command and see what we might want to do with it. I’ll go with QUICKLY OPEN THE LITTLE CHEST ON THE ROUND TABLE USING MY SMALLEST SKELETON KEY.

OK, well the first thing to notice is that the adverb could equally well go after the verb (OPEN QUICKLY), before the main preposition (QUICKLY USING) and at the end (KEY QUICKLY). With the adverb removed, the structure is basically a verb

followed by a noun phrase followed by a preposition followed by another noun phrase. In other words, you want to do something, you want to do it to something, and you want to use something else to affect how you do it. Note that we can shuffle the modules around a little: <preposition> <noun phrase> <verb> <noun phrase> would work, as in USING MY SMALLEST SKELETON KEY QUICKLY OPEN THE LITTLE CHEST ON THE ROUND TABLE.

Aside: although a maximum of two noun phrases is ample for most MUDs, in theory you can allow more. GIVE THE COAL TO THE TROLL USING THE TONGS, for example, has three (THE COAL, THE TROLL and THE TONGS). The problem with this comes later, during the execution phase: the second and subsequent noun phrases can appear in any order (GIVE THE COAL USING THE TONGS TO THE TROLL) and this is something of a bother when it comes to matching function calls to definitions.

Anyway, pressing on, we’re now at the point of being able to parse as far as the noun phrase level. So what is a noun phrase?

A noun phrase is a form of words that identifies an object or a set of objects. There are two in the example sentence: THE LITTLE CHEST ON THE ROUND TABLE and MY SMALLEST KEY. You’ll notice that the first of these itself contains two noun phrases, separated by — eek! — a preposition. Let’s call the smaller versions noun groups. A noun group looks like it can have an article first (but it doesn’t have to) then any number of adjectives (and I guess we could put integers here, too) then a noun (or presumably a pronoun). At this point, we’re down to the level of individual words, so we can’t do any more.

A Grammar

Believe it or not, using this informal approach we’ve actually achieved quite a lot! We’ve defined a grammar for a line of input:

An input is a sentence, optionally followed by a full stop (or question mark or exclamation mark) and another input, eventually terminating in an end-of-line.
A sentence is a command, optionally followed by a conjunction and another sentence.
A command is a verb, optionally followed by a noun phrase, which, if present, is optionally followed by a preposition and another noun phrase. Alternatively, it’s a preposition followed by a noun phrase, then a verb, then another noun phrase.
A noun phrase is a noun group optionally followed by a preposition and another noun phrase.
A noun group is an optional article followed by an optional integer followed by any number of adjectives followed by either a noun, a pronoun or a string.

You’ll have spotted that this is recursive, in that a definition can contain a reference to itself. If you know about these things, you may also have noticed that it’s right-recursive (or tail recursive), in that the recursive reference is on the right of a definition (X::= Y z X) rather than on the left (X::= X z Y); this is because expressing the idea using right recursion makes implementing such a grammar easier than using left recursion (you don’t get stuck in recursive loops!).

Yes, I know, I missed the adverbs out. The reason for that is because it was just too tortuous to explain where they could go. This happens often in describing grammars, so computer scientists have adopted a standard notation to help them. This is BNF (Backus-Naur Form), and it’s been in use since 1960.

A BNF Grammar

BNF isn’t hard to follow, especially as I’ve already been introducing it surreptitiously during these articles...

Everything in BNF is either a category (in which case it appears in <angle brackets>) or a terminal symbol (in which case it doesn’t). Categories are things like <noun phrase> and <sentence>, that you want to explain in more detail; terminal symbols are things like noun and verb, that you don’t (although you could, by enumerating all the nouns and verbs you were planning on accepting — not usually a good idea!). Every category used appears on the left hand of a ::= symbol in a rule that describes the ways by which it can be composed. The | symbol is used to mean "or" for alternative definitions, and it can call on (round brackets) to show exactly what is being ored with what.

Additionally, anything in [square brackets] is optional, and anything in {curly brackets} can be repeated any number of times (including 0). That’s all there is to it!

Here, then, is the BNF version of the grammar I gave above, with the inclusion of adverbs (although I’ll mainly miss these out again when it comes to giving implementation examples, as they get in the way somewhat). I’ve also replaced some of the recursive definitions by iterative (repetitive) ones, as right-recursion allows this - the two are equivalent:

<input> ::= <sentence> { end_of_sentence <sentence> } end_of_line
<sentence> ::= <command> { conjunction <command> }
<command> ::= { adverb } verb { adverb }
                [ <noun phrase> { adverb }
                    [ preposition <noun phrase> { adverb } ]
                ]
            |
                { adverb } preposition <noun phrase> { adverb } verb
                    { adverb } <noun phrase> { adverb }
<noun phrase> ::= <noun group> { preposition <noun group> }
<noun group> ::= [ article ] [ integer ] { adjective } ( noun | pronoun | string)

This is a fairly short but surprisingly powerful grammar that describes something akin to a subset of English that you could use for a MUD. I say "akin to a subset" because it accepts things that aren’t English such as OPEN DOOR (as an imperative command, English requires an article — OPEN THE DOOR) or GET RED "HELP" (English doesn’t apply adjectives to strings). As features for MUDs, these beyond-English extras are either good or neutral; they’re never problematical. If players like them, they’ll use them; if they don’t then they won’t.

Other grammars are possible, of course, for more complicated pseudo-English languages. MUD2’s goes into more detail over the parts of speech, so for example it distinguishes between relative pronouns and simple pronouns and it can handle possessives (GET FRED’S SWORD); basically, though, it’s the same as the above. Obviously, you can tinker with a grammar as much as you like, so long as you remember that it’s you who’s going to have to implement it.

Ah, yes, implementation. I’ll begin to describe how to go about that next time...

Recent Discussions on Notes from the Dawn of Time: