|Trials, Triumphs & Trivialities #130:
August 14, 2003 - More than once in this column, I've talked about administration (see, for example, #68 or #96). Administration is a big deal perhaps the biggest deal in the design of any game. Recent books on the subject, such as Jessica Mulligan's Developing Online Games claim that the post-release work can be up to 90% of the total workload of an online game.
What recent books perhaps don't lay out is the sheer terror that can go along with post-release fixes most specifically the terror that attends the occasional necessity to make code changes to live servers that players are currently using.
Why Live Updates? Some Examples.
At first the idea of live updates might seem insane. Unfortunately, they're often required in an always-on environment. Last week I had to make live updates to two online communities (one was a game, while the other was a forum), and I think they exemplified well the need for this sort of thing.
My game change was the more nerve-wracking one. When Galactic Emperor: Hegemony was originally coded, its lock-file methodology was set up in such a way as to accidently allow a rare race condition. Or, in non-Greek: it was possible for two people to try and change the state of the game at the same time if they happened to hit the game within milliseconds (nanoseconds?) of each other. Caution led me to avoid changing the lock-file methodology for a long time, but then this week someone's play style exacerbated the bug. I don't know why perhaps they had a style of using two windows at once or perhaps there was just a weird lag in their dial-up line. But, whatever the case, there was a game of Hegemony that got wiped out three times in a day on Monday. Thus I needed to make a very major change, and I had almost no way of testing out the change beforehand, because the error typically only came up every couple of million game hits.
My forum change was less demanding, but also more visible. I spent part of an afternoon mucking around with the HTML templates at the RPGnet forums to make them better match the rest of the site. Unfortunately the VBulletin that we use there doesn't really have a very good HTML previewing system, so I made changes and on occasion the forums on RPGnet looked really ugly for a minute or two. No loss of functionality, but not my preference either.
Finally, to rip an example from the headlines, consider the newest Microsoft worm that's running around due to bad security programming in Windows. Would you, as a potentially affected user, rather wait a month or two while the evil programmers at Microsoft stress-test their newest patch, or would you rather have a patch now as long as it's been tested at least a bit?
Why Live Update? The Reasoning.
As an administrator, I believe you're going to have to make a constant cost-benefit analysis as to how raw your code can be when you release it to your live servers.
On the cost side you have to calculate not just how much time or resources it would take you to more extensively test a system, but also the potential harm it would do to a user if there was a problem with the raw code.
On the benefit side you have to calculate how much your players/community-members will benefit from a patch that's released more quickly.
When you put all those together, you should have a fairly compelling answer one direction or another.
Looking at my recent Hegemony escapades, I'd lay things out as follows:
Because I was very confident in my new algorithms, having researching them quite thoroughly, I went ahead and pushed the system out. And, a few days later, things look good. Whew.
Running the same analysis of my RPGnet changes:
If I'd had more time at RPGnet I probably should have looked into a testing-only VBulletin instance, but given that the potential cost to users was so low, there probably wasn't much harm in going ahead and pushing the changes out.
As an administrator you should always test your code as much as you possibly can before releasing it into the wild. But, eventually, the costs for additional testing will grow high enough that you have to bite the bullet and release. This could be after months of strenuous training or moments after you've written the code and pushed it through a couple of paces. Just be aware that there are options here, and saying that you should never update live code without X months of training is as dumb as saying you always should.
Alleviating Live Updates.
So, you're getting ready to introduce some partially raw code to your users. What can you do to help alleviate that, so it's not either a gameplay or public relation disaster? Here's a few ideas:
Announce in Advance: Let people know in advance what you're going to do and why. Not being surprised in and of itself will probably please your users. They'll also be a lot quicker to understand if there is some problem.
Enforce a Downtime: Sometimes, if your live updates will be particularly disruptive as you put them in place, you may want to enforce a (pre-announced) downtime. Don't let any of your users into your service for a set amount of time, so that you can make sure that your new functions are all working correctly on the live environment before your players see the chaos that you've wrought.
Backup Your Code: Make sure that you have a copy of your old code so that you can rollback at a moment's notice, if need be. For the 35 or so distinct changes that I've made to Hegemony since its release, I've only rolled code back once or twice. But, let me assure you, it was very nice to be able to do that rollback and then solve the revealed problems at my leisure, rather than while players were breaking down the doors.
Backup Your Data: You should already be backing up your user/player data. However, if you're putting in a really big code change, you might want to increase the frequency of that backup to minimize any damage down by problems. For example, as part of my major Hegemony code change this week, I upped data backup from daily to hourly. I'm going to need to back off of that soon, as it's just eating up storage space, but it's been really nice to have that cushion, and feel confident that any damage would be minimal.
You're never going to be able to test code as much as you'd like, and thus at some point or another you're going to need to put it on a live server. Costly testing or huge benefit might require you to do that push much earlier than you're comfortable with. That's OK, as long as you've carefully measured what the cost and benefit are, and have also considered various ways to alleviate problems.
Nerve-wracking? Yeah, but sometimes necessary.
One final announcement before I close up this week's column: because of increased busyness at the business end of Skotos my sanity has kindly requested that I relax a bit, and so I've decided to push this column back to biweekly, as I have on occasion before during its run. So, watch for me back here in two weeks, rather than the normal one.
On the bright side, I hope that this change will let me feel a bit more confident about covering some of the more difficult topics I've been avoiding lately due to research and analysis that are required. I hope to change that in 2 weeks, with my delayed consideration of the history of RPGs.