Collective Choice: More Thoughts About Ratings
by Shannon Appelcline
It's now been several months since we launched the RPGnet Gaming Index, and it's generally provided an excellent petri dish for a lot of the collective choice ideas that I've discussed here in TT&T and with Christopher at Life with Alacrity.
Here's a quick overview of seven of the lessons that I've learned, without any attempts (yet) to turn it into a more cohesive whole:
1. Drive By Ratings Are An Increasing Problem on the Net
Once upon a time on the 'net you could have a poll on your site, and expect it to provide a good sampling of what the people at your site thought. Increasingly, however, various groups on the 'net are on the lookout for these polls and actively mobilize large groups of people to go weight the poll in their direction. Political polls are the most heavily effected, with the result being that they've become almost worthless as a tool to measure actual public viewpoints.
It's the spam of the collective choice world.
For the RPGnet Gaming Index we've seen the problem in miniature. There isn't such organized activism pushing people in our direction, but nonetheless individual publishers have started linking to individual game entries and asking their people to rate that game--and they do.
Now we knew this would be a problem from the start, so I built a trust-based mechanism to measure whether someone was a good rater or not, as I wrote in Who Do You Trust?. The idea was simple: give each rater a weight, within a specific range, which was directly based on the number of ratings they'd made.
Most drive-by raters stop by, and just rate the one item, or maybe a few more from the publisher, and so this initial solution solved 90% of the problem. I've since adjusted the formula slightly to totally ignore raters who just offer one or two ratings, which probably upped the success rate to 99%, but I'm still concerned that consistent direction of raters to the site could influence the results in the long-term.
If we hadn't offered solutions for this problem, I think our results would be wildly skewed. Out of 672 distinct raters in the index to date, only 336 (or exactly 50%) have rated more than 2 items. Further, of the top three games that were affected by drivebys, 38%-46% of the ratings were driveby ratings that are now being ignored.
2. People Rate Less Than We Thought
Setting up our trust-based mechanic involved guessing at how many items a person would rate on average. Clearly this is going to vary from industry to industry, but for roleplaying games I guessed that the average "good" rater would rate 100 items and a "great" rater would rate 200.
A few months in, looking over the numbers, I realized that we'd guessed high, and people rated less than we expected, which resulted in a small number of "experts" providing much of the weight of the ratings in the Index. Since then I've dropped the numbers to 50 being a "good" rater and 100 being "great".
The original numbers marked 1.4% of raters as great and 2.8% as good. The new numbers change those to 2.8% great and 6.1% good. (Of course a much larger percentage of the actual ratings are maked all these levels because these are the raters who have rated a lot.) Our trust-based metric still creates a panel of experts, but it's now a larger one, and my general belief is that the percents will slowly increase as the Index grows, since people who might right now be sitting at 48 or 46 (or even 20) ratings will return again and again to rate more items.
3. The More Someone Rates The Better Their Ratings Are
Our trust metric was built upon the belief that the quality of a raters' ratings increase as his number of ratings do. The only way we can objectively measure this is by looking at their spread of ratings, under the belief that a good rater will more likely move toward the average while a less useful rater is more likely to rate just on the high end of the scale--providing less distinction between individual products.
This is borne out by the data thus far. Those one- or two-timer raters who are currently ignored in ratings, have an average rating of 8.75. The average rater, meanwhile, has a rating of 7.10. The weighted average, which includes our trust metric and thus looks more toward the heavier raters, is 6.79. Before we halved our current trust metric, to take weight away from the experts, the difference between the unweighted average and the weighted average was .2 points more.
Pretty much anyway we look at it, the more someone rates, the more their average rating approaches the center of the field, and thus the more distinction they provide between different items, and the better their set of ratings is.
4. Ratings With Comments Are Better Than Those Without
Another one of our trust metrics to weight ratings is to weight a rating more when it has a comment than when it doesn't. Again by using the objective measurement of "better" ratings being more average (on average), this seems to be working. The average rating with no comments is 7.2 and the average rating with comments is 6.8. Looking at the weighted averages opens up this difference a little further: 7.0 for ratings with no comments versus 6.4 for ratings with.
Anecdotally, I've personally found this to be an entirely valid metric. If I write a comment it's because I'm more familiar with a book, and my rating can indeed be better trusted. If I don't it's usually because my recollection is fuzzier. (Not everyone will have the same reluctance to write a truly meaningless comment, but the statistical analysis suggests that enough do to make a difference.)
5. Reviews Aren't As Good as We Thought
The final chunk of our trust metric came from reviews. We gave full text reviews (entered through a different system) a weight 25% higher than our most trusted ratings-with-comment and 150% higher than our must trusted ratings-without-comment. The theory was that if someone took the time to write a review they had a really solid understanding of the product.
Though that's true it didn't take into account the fact that the purposes for writing a review and entering a rating are very different. Someone just takes a minute to enter a rating, and they do so to give an opinion back as a member of a community. Writing a review can take an hour or more, and is more often than not done to talk up a great product. As a result, reviews tend to skew high--which we already should have known from past analyses of our review database.
The average review, using our conversion formula, earns a 7.28, while our weighted database average is, as noted, 6.79. Perhaps that's not terrible--it's just a bit more than the difference between our weighted and unweighted average ratings--but we actually weight these reviews high, and so there's the concern that we're giving more weight to values that might be less good.
For now I'm punting on this issue. As of today reviews make up less than a third of the total weight of our ratings, and I expect that number to decrease over time, precisely because it's easier to rate than review. We've had ten years to create a database of reviews and half-a-year to create a database of ratings, so I suspect that even if the reviews are a little out-of-whack they'll disappear into noise as the database grows, and as I've said previously that's one of the real benefits of a large ratings database: you don't have to sweat the little stuff.
6. You Have to Think About Significant Digits
In the actual implementation of our database one issue I had to consider was significant digits. In other words, how specific did we make a rating? As you may have guessed from the article thus far, I chose two significant digits (e.g., 8.58) for all ratings.
I really don't know how accurate we can say our ratings are, or even what one might mean by accuracy in a ratings system. Clearly we're using our trust metrics to try and produce a somewhat unbiased rating, but I don't know how unbiased. If we were polling I could roll out margins of error, but a rating isn't trying to predict a future outcome, it's just trying to say what's good and what's not.
On the flipside, clearly there's going to be a point where a number stops being meaningful and starts being random. I thus decided that the third significant digit was meaningless--just part of that noise that I talk about. I suspect the second significant digit might be too, at least with the 10,000 ratings that we currently have in the database, but I kept it in anyway.
This leads to a problem: with just two significant digits on a ten point scale I only have 1000 possible values, from 0.01 to 10.00. Because we use bayesian weighting to push things toward the center, and because people usually rate high, that range is shortened even more. Currently the game runs from 3.56 (FATAL) to 8.58 (Delta Green: Countdown), for 502 possible values. Meanwhile we have 5219 distinct rateable entries in the database, meaning that on average ten different entries share each rank in the index.
It seems like an argument for adding a third significant digit, but again I'm just not convinced that would be meaningful. So instead we differentiate items at the exact same rating by a "popularity" metric: games with more weight of ratings are rated higher, and for games with the same weight of ratings the one with more pageviews is rated higher. This might ranks things in the wrong direction for the worst games, but the best games are what are more interesting to people, and I think this measure of their popularity is a much better method for saying which is "best" for them than a third significant digit that's probably entirely noise.
(And I'll also comment that the problem of same-value rankings isn't as large as I make it out because the vast majority of entries sit in an unrated mass in the middle, which is entirely a topic for another day.)
7. Any Rating is Ultimately Arbitrary
With all that said, it's worth finishing up with the statement that any rating is ultimately arbitrary. As I've recounted here we use various trust metrics to try and give the "best" rating possible. But, by changing any of those metrics, the ranks within our index change. Not a lot--the best twenty-five games tend to stay at the top and the worst at the bottom--but they do change.
And thus a set of rating-based ranks is ultimately a result of a very particular set of individuals rating a very particular set of items at a certain place using certain rules. Change any of those and your results change.
So take ratings with a grain of salt.