We at DERP Group have had plenty to say about the problems with Amazon's current system of reviews for Alexa Skills, the effect this has for developers, and some of the shady practices surrounding skill reviews currently. In this post we're specifically concerned with the way the Average Rating sort option in the Alexa companion apps (and alexa.amazon.com) does not effectively bubble the best skills to the top. For a change of pace, though, instead of just calling out the problem, we figured we'd try proposing a solution as well. Click through to read more...

**tl;dr - Amazon's Alexa skill discoverability is hindered by a poor sorting algorithm. We describe how it works, show that it's obscuring really good skills, and then propose an algorithm that would work better, as well as a way to tune it for different factors.**

So, as we've mentioned a couple times previously, the current skill sorting options provided by Amazon are fairly inadequate. You can sort alphabetically, you can sort in reverse chronological order of release date, and you can sort by average rating. This last option should theoretically be the primary option people use, since "best average rating" is synonymous to "best quality" in most peoples' minds. Unfortunately, the sorting algorithm lacks the necessary nuance, and so good skills get hidden, and unproven skills get promoted.

## A primer on the current sorting algorithm

If you're already familiar with Amazon's rating system and average review sorting algorithm, go ahead and jump to the next section. For everyone else, here's the deal. Skills are rated on a 1-to-5 star scale, with 5 being the best. Ratings are tied to a textual review, and can only be granted by owners of an Alexa enabled device.

When sorting by average review, the following steps occur:

- The primary sort sums the number of stars, and divides by the number of reviews. Hence, a skill with three 4-star reviews ( (4+4+4) / 3 ), and a skill with one 3-star and one 5-star ( (3 + 5) / 2 ) review will both have an average rating of 4. This is calculated to at least two decimal places, possibly more. The skills with the highest scores show up first in the list.
- If two or more skills have an identical score, the secondary sort is calculated on total number of reviews. The more the better. So in the example above, the skill with three 4-star reviews would show before the other skill in the list, since its three reviews beat the two reviews of the other skill.
- Finally, if two skills are tied in both of the previous two categories, then the user viewing the skills will see the tied skills randomly sorted relative to each other each time they load the companion app. (As an aside, there's a caching bug in the Alexa browser site that can make certain skills in this situation disappear completely)

## So, what's the problem?

On the surface, this seems fairly reasonable. If your skill is well reviewed, it will likely stay near the front of the listings. Well, the problem is this:

This is a classic example of how the algorithm fails the user base. We have here a screenshot of four skills, taken from the first page of skills. The top two skills are tied for #4 overall with three other identically rated (5-star, 1 review) skills, but to save space lets only look at these two.

At this point, you may be thinking to yourself that I'm a liar, and that the algorithm isn't what I said, because

*The Wayne Investigation*and*Magic 8-Ball*should be on top with their 5-star ratings and multiple reviews. It turns out, though, that Amazon only shows stars in half-star increments. Anything above an average of 4.75 shows up as 5-stars, anything between 4.25 and 4.75 displays 4-and-a-half-stars, and so on. If we dig deeper into the reviews for*The Wayne Investigation*, here's what we see:This skill has a 4.9 out of 5.0 average rating - which is to say it is

__extremely__well reviewed. This rating exists over 19 reviews of the skill. Yet as of now, this skill is considered to be*less interesting*to the average user than a skill that has been reviewed a single time.To add insult to injury, this is a fully-realized adventure game, built by a development team at Warner Bros, and it is losing to several skills that were almost certainly made as part of the "1-Hour Skill" tutorial Amazon put up a couple months ago. Amazon's sorting algorithm had made a qualitative judgement that the results of someone copy and pasting code for less than an hour is

**worth more**than the (presumably) hundreds of hours Warner Bros put in to this very well received game that is helping to move the state of Alexa development forward.The real "cherry-on-top" of this whole thing? One of the (unpictured) skills that has a single 5-star review has

__only been reviewed by the creator of the skill itself__. Literally zero people who didn't create the game have stated their interest in it, and still it is deemed to be more valuable than*The Wayne Investigation, Magic 8-Ball*, or any other commonly-and-highly-rated skills.This sorting algorithm is just like the corrupted city of Gotham - there can be no justice with a system so broken. And much like the Wayne murders, vigilantism may be the only path to redemption. In this context, I speak of the vigilantism of random developers (read: me) proposing new algorithms.

## Less Batman analogies, more algorithm, please

Alright, so before trying to come up with a better algorithm, I think it's useful to describe in plain speech what should be different about the way skills are sorted.

At present, the leading skills are those which have

**been said at least once to be exceptional**, and which have**not yet been said to be unexceptional**. This system is flawed in that skills which have not been reviewed to a level of statistical significance are the most likely to meet this criteria.An alternate approach would be to promote skills that

To achieve this, I propose the following algorithm to calculate the canonical value of a skill based on its reviews (where a higher score is better):

**people have consistently found to be valuable**, and that**people have not consistently found to be broken or worthless**.To achieve this, I propose the following algorithm to calculate the canonical value of a skill based on its reviews (where a higher score is better):

Where:

To start out, lets assume that we do want things to scale perfectly linearly, so we just define modifier(x,y) to be:

- m is the mean possible score (currently, 3 stars)
- r is the rating value of an individual skill review
- modifier(x,y) is a an arbitrary function that allows us to weight each review's contribution to the score in a non-linear fashion

To start out, lets assume that we do want things to scale perfectly linearly, so we just define modifier(x,y) to be:

So, in this example, for each Alexa skill review a 3-star review has no effect on the overall rating of the skill, a 4-star review adds 1 point to the total, and 5-stars adds two points to the overall score. If we were to apply this algorithm to the skills in the pictures above, we'd end up with the following aggregate scores (and thus, ordering):

- The Wayne Investigation - 36
- Magic 8-Ball - 19
- TIE: Utterly Body Quiz / CheerLights - 2

## The Modifier Function

Maybe we think that 1-star and 5-star reviews should still be extra special - we can weight them as such. Or maybe we believe the opposite is true, and think that the results are already too polarized, so a 5-star shouldn't be twice as valuable as a 4-star review - we can handle that easily too.

There are a bunch of ways we could approach this function - we could hardcode modifiers for each possible input, we could use some sort of logarithmic scale, etc. My first crack at it is this:

In this case, we subtract the second value from the first, find the absolute value, and then take that value to the power of

*a,*where*a*is some tunable real number. For the sake of argument, lets imagine Alexa's rating range was a bit broader and allowed for ratings from 1-star to seven-stars (with a mean possible rating of 4-stars). For possible reviews of 5, 6, or 7-stars, how would that modify each review's contribution to the overall skill's score? If we start with a = 0, we see that the output from the original formula is scales exactly the same way as it did when we just assumed the output of modifier(r,m) to be 1:This is expected, since the with

*a = 0*, the output of the modifier function IS always 1. This is the linear version of this equation. If we shift*a*into the positive territory, we start to see that higher ratings become exponentially better than their more median brethren:Now we're starting to see some interesting patterns. When

*a = 1*, the formula is basically following an x² pattern. At*a = .5*, each point above the mean is calculated as x√x. In each case, the highest possible reviews also add premium value to the overall score. As you might imagine, we can achieve the inverse as well:If you set

*a = -.5*you'll note that it's still good to get higher scores, but being two points above the mean is not even twice as good as being one point above it, when the final score is calculated. And you get a really interesting case at*a = -1*, where the impact of ANY positive review on the overall score is just going to equal 1. Concordantly, any negative review would mean a single point deduction from the overall score. Any value lower than -1, and you'd end up in a weird territory where it's actually*worse*to get the highest possible review than a slightly good review - weird.## So what modifier should we use?

With the modifier function as written, a value between negative one and zero is a vote in favor of stronger weighting for number of overall reviews. A value greater than zero will lend more credence to the polar reviews, both 1-star and 5-star. In fact, aside from a few specific boundary cases, you could almost consider the current system to be an implementation of this algorithm with a value of

The question of what we "should" do really depends on which aspect we find more important. If it were up to me, I'd probably start with a linear implementation, see how things turned out, and then tune it accordingly.

*a* approaching infinity.The question of what we "should" do really depends on which aspect we find more important. If it were up to me, I'd probably start with a linear implementation, see how things turned out, and then tune it accordingly.

## What happens to unreviewed skills?

We haven't talked about where unreviewed skills fall into this whole mess, yet. Currently, they sit at the very end of the list, behind even the 1-star skills. A side effect of our proposed algorithm is that the relative ranking of unreviewed skills changes as well.

The reason these skills fall where they do at present is because in Alexa's eyes they have zero review stars to divide, and (avoiding the whole zero divided by zero mess) they end up with what is basically an impossibly low review score of zero, even worse than the worst reviewed skills.

In the new formula, this doesn't actually change - unreviewed skills are still scored at zero (although, without any chance of division by zero!). What we've done, though, is essentially shift the goalposts so that zero is no longer the worst score, it's the mean score. Since it's possible for other skills to go into the negative, you'd now start in a perfectly neutral middle-ground when your skill is first certified. This feels to me like the correct behavior - a skill that only has bad reviews is probably less interesting than one that is a complete unknown quantity.

## What are the downsides

So, this system is obviously great (and you can all thank me later for it), but by no means is it perfect. Here are a few things to consider:

- No matter how well you tune a system, people will figure out how to game it. One weird benefit of the current system is that it would theoretically be easy for any casual observer to ruin the schemes of someone cheating their average score.
- It doesn't account in any way for the variability among reviewers. One interesting things we've actually seen multiple times when researching posts about reviews is that some reviewers just will not give out full scores. The textual components of their reviews will be chock-full of effusive praise, and somehow they still deign only to grant 4-stars. This is a topic we can (and likely will) talk about at length.
- Skills are coming out fast enough now that even fixing sorting-by-review isn't going to solve all of the discoverability issues going forward. Some sort of faceted classification or search/sort/filter combination is going to be needed eventually.

## Thoughts?

So, what do you guys think? Should we start harassing the Alexa team to implement it? Think you could come up with a better system without getting super complicated? Just want to hate on mine because you're a troll? Let us know in the comments, or hit us up directly.