Notes from the Dawn of Time

Notes from the Dawn of Time #18:

The Parse Tree

by Richard Bartle
May 8, 2002

So you’ve stayed up until the small hours, night after night, and you finally have yourself a parser. Players can type in complicated imperative sentences and your program will accept them or generate a stream of expletives. How do you use it, though?

Well, what does it produce? It’s fairly trivial to get it to record what parts of speech were assigned to a token – you just write the successful ones into a separate array during the parse() stack unwinding. You can do more, though: you can create a parse tree.

A parse tree is a data structure that represents not only what the tokens in an input line mean, but how they were generated. It’s where the earlier decision of what to make into the “rules” of your BNF grammar finally becomes important.

Here’s an outline example of what a parse tree looks like:

                   <input line>
                           |
                   <sentence>
                           |
                   <command>
                           |
                   <command 2>
                           |
+-----------------+--------+--------+-----------------+
|                 |                 |                 |
|           <noun phrase>           |           <noun phrase>
|                 |                 |                 |
|           <noun group>            |           <noun group>
|                 |                 |                 |
|           +-----+-----+           |           +-----+-----+
|           |           |           |           |           |
WITH        THE         DIAMOND     RING        THE         BELL
preposition article     noun        verb        article     noun

WITH THE DIAMOND RING THE BELL
preposition article noun verb article noun

From this data structure (which you can build up piecemeal on the unwind of the recursion – a tiresome but not time-consuming exercise) it’s possible to identify not only what the individual words refer to, but what the individual rules do. Thus, if you want to find out what the second noun group of a command is, you really can just “read it off” the data structure.

If you prefer a brute-force approach, then you don’t strictly need the data structure – you can rip elements out of the token array by relying on what you know about your grammar (e.g. anything between a preposition and either a verb, preposition or end of sentence is a noun phrase – if you ignore the adverbs and hack your way round conjunctions). In the case where your parser is only partially backtracking, like MUD2’s, you’ll pretty well have to do it this way.

Making meaning

So it’s possible to group tokens together into meaningful grammatical units. Now you have to figure out what they mean in game terms.

This is where the vocabulary comes in. I described the vocabulary way back in column 11 as a set of triples: <word, part of speech, atom>. Tokenisation used the first two of these – it converted the words that players entered into tokens. Now we use the second two – we convert the tokens into atoms. Atoms are things the game can use; they refer unambiguously to a single concept (although not necessarily to a single referent of that concept, as we shall see when we discuss the binder).

MUDs are focussed on commands. These are unitary instructions telling the game to do something. The parse tree lets us see how the input line was split into sentences, and how sentences were split into commands, so we know what each individual command consists of in terms of tokens (and therefore atoms). How does that give us a handle on what the sentence means, though?

OK, this next bit comes from AI computational linguistics. It’s is pretty obvious anyway, so I’m just going to say it and not explain the rationale...

The verb of a command is the function in a function call. The first noun group (if it has one) is the call’s first parameter; the second noun group (if it has one) is its second parameter. Adverbs and prepositions are functions that modify the verb; adjectives are functions that modify what the noun refers to; pronouns are dynamically assigned nouns.

So what you need to produce from your parse tree is a data structure of atoms that reflect this information, and which is implemented in a form directly acceptable by your game engine. How you decide to do this is up to you, but for the sake of simplicity I’ll assume it to be a list with nested sublists:

[ verb [verb modifiers] [noun phrases] ]

The verb modifiers are just the adverbs and prepositions that the command contains. The noun phrases themselves have their own particular format:

[ noun [noun modifiers] ]

where the noun modifiers are the adjectives and qualifying noun phrases associated with the noun.

Examples

Here’s are a couple of examples I’ve used before in this series of articles:

WITH THE DIAMOND RING THE BELL
[ ring, [with], [ [diamond, [the]], [bell, [the]] ] ]

TAKE THE GREEN APPLE FROM THE BOX THEN HIT IT WITH MY SWORD
[take, [from], [ [apple, [the, green]], [box, [the]] ] ] [hit, [with], [ [it, []], [sword, [my]] ] ]
Here’s a big, meaty example to show you the kind of thing you’ll be able to boast your parser can handle. I’ll write out its list in a more structured fashion so you can see how it’s made up without having to count brackets:

VERY QUICKLY DROP THE TREASURE EXCEPT THE EMERALD AND THE SMALL WHITE STONE INSIDE THE OPEN MUSIC BOX SECRETIVELY

[   drop,
    [inside, very, quickly, secretively],
    [   [treasure,
            [the,
                [except,
                    [and,
    [   [emerald,
                            [the]
                            ],
                            [stone,
                                [the, small, white]
                            ]
                        ]
                    ]
                ]
            ]
        ],
        [box,
[the, open, music]
        ]
    ],
]

This list is the result of the parsing module, and it’s passed on to the binder. The binder has to take the list, and use its own knowledge of the game world’s state to derive individual function calls. You don’t have to write the binder in your game world’s definition language, but it’s a darned sight easier if you do (MUD2’s is written in MUDDLE like the rest of the game – now; I learned the hard way...). The binder is the final step of the whole parsing process, but it can be among the most tortuous and frustrating of them all because there are so many special cases involved.

I’ll begin my discussion of the binder next time...

Recent Discussions on Notes from the Dawn of Time: