Collective Choice: Ratings, Who Do You Trust?
by Shannon Appelcline
In the past Christopher and I have talked about collective choice and ratings. As we wrote at the time, "The best way to make your ratings statistically sound is with volume. If you can manage thousands or tens of thousands of ratings for each item, any anomolies are going to become noise."
However, you won't always be able to present ratings systems that start off with (or will ever have) high volumes of ratings. In this cases, you have to consider new methods to make your ratings more accurate and useful. How do you do that? You answer the question, "Who do you trust?"
#1: Trust Ratee Volume
In our original article, Christopher and I suggested a first method to improve low-volume ratings, which we call "bayesian weighting". Here, you weight an item's rating toward the norm, so that early ratings, which are likely to have a very high margin of error don't widely vary a rating.
The purpose of bayesian weighting isn't to make the ratings more accurate (because it does exactly the opposite by making the result less reflective of the data the user actually enters) but instead to make resultant rankings more accurate, which is to say the comparative difference between distinct items in the same rated database. Without bayesian weight, an item with a single rating of 10 out of 10 would thus be ranked best item ever. However with a bayesian weighting that single rating of 10 is mixed with a heavily weighted average and thus it might move the item's rating from an average 6 to a 6.1 rather than all the way up to a 10.
This method is well based in statistical analysis of the margin of error, which simply says that the more ratings that a survey includes, the more likely it is to be accurate.
The simplest calculation for margin of error, assuming a (high) 99% confidence rating is:
Where "n" is the sample size.
Thus if you had 1 person input his rating, you'd have a margin of error of 129%, meaning that there is a 99% chance that the rating is between 129% lower and 129% higher than the result. In other words, it's totally useless.
With a 100-person sample, the margin of error decreases to 12.9%. For a 10-point rating system that means that if an item is rated "7" (above average), it's really somewhere between "6" (average) and "8" (great).
You need 10,000 people to get that margin of error down to 1.29%.
A bayesian weighting helps to offset this by keeping ratings with a higher margin of error toward the norm, thus preventing them from upsetting the highest or lowest rankings until their statistical basis is more sound.
It's a good start to making a low-volume set of ranks more trustworthy.
#2: Trust Rater Volume
Particularly on the Internet we run into another issue: drive-by ratings. Especially in the world of blogs and other user-driven content it's becoming increasingly likely that authors, designers, and publishers will provide links to rating systems on their blogs, and ask their fans to go use them.
All of the above discussion of margins of errors was for "random samples" and that goes straight out the window when raters are self-selected to be people who particularly like a certain book/game/item and then only rate that book/game/item. Thus we must begin to consider another type of volume.
Where before we discovered that high-volume ratings of items were more accurate, now we can consider if high-volume raters (who rate multiple items in the same ranked database) are more accurate. In our limited study of raters at the RPGnet Gaming Index, this indeed seems to be the case:
The users who have rated 55 or more entries have an average rating from 6.0-7.4. As we drop down in rating totals the variance of the ratings increases. Our first 8 average, an 8.7, appears for a user with 54 ratings and numerous 8s appear after that. When we get down to single digit rating totals we find a 9.125, numerous other 9s, and then even some 10s.
Clearly, we'd need to make a study of a larger rating database for this data to be trully meaningful, but thus far the info is suggestive. Especially at the edges, an individual user's ratings vary widely from the norm, and thus are more likely to corrupt a general database.
I've considered some methods to resolve this, including changing a user's actual ratings to bring them toward them norm. However, the idea of bayesian weights suggests another answer: give each user a different weight, based on his number of entries to the overall rating pool, and use that weight on each of his ratings.
Thus, the user who's made 100 ratings, and who statistically seems closer to the norm, gets more weight on each of his ratings, while conversely the drive-by rater who was directed to the rating database by a blog and is not actually part of the community gets very little.
#3: Trust Content
As Chris recently discussed, an Amazon study found that undetailed ratings showed bimodal distributions, while we've discovered that our own detailed reviews showed bell-curve distributions. This suggests another thing to trust in ratings: content.
If someone just rates an item with a number, that's likely to be less accurate than just rating an item with a number and a short notes and that's likely to be less accurate than just writing a full review of an item. Or so we surmise from our data thus far.
As a result it makes sense to give plain ratings a lower weight and full reviews a higher weight. It's yet another easy and mechanistic way to measure who you trust.
Putting It All Together
On the RPGnet Gaming Index we've put this all together to form a tree of weighted ratings that answer the question, who do you trust?
Here's how we measured each type of trust, and what we did about it:
These all get put together to create our final ratings for the Gaming Index, with each user's individual rating for an item getting multiplied by its user weight and its content weight, and then all of that averaged with the other user ratings and the bayesian weight too. The result is in no way intuitive, but users don't really need to understand the back end of a rating system. Conversely we hope it's accurate, or at least more accurate than would otherwise be true given the relatively low volume of ratings we've collected thus far.
Currently the calculations are a bit time-intensive. We run them once a night and they take about 10 minutes (though better data storage would dramatically speed this up).
However, being forced to run things at night like this actually introduces another benefit to our system. Because users don't get live gratification from their rating (e.g., seeing the rating bump up by .1 or even .01 when they enter their data), they're less likely to exaggerate ratings specifically to get that gratification.
Offering us one last answer to the question "who do you trust?"
We trust users who want to enter data for the good of the ratings database, rather than their own individual gain.
Now if only we could figure out a way to entirely mechanize that criteria.