**tl;dr - Amazon's Alexa skill discoverability is hindered by a poor sorting algorithm. We describe how it works, show that it's obscuring really good skills, and then propose an algorithm that would work better, as well as a way to tune it for different factors.**

## A primer on the current sorting algorithm

- The primary sort sums the number of stars, and divides by the number of reviews. Hence, a skill with three 4-star reviews ( (4+4+4) / 3 ), and a skill with one 3-star and one 5-star ( (3 + 5) / 2 ) review will both have an average rating of 4. This is calculated to at least two decimal places, possibly more. The skills with the highest scores show up first in the list.
- If two or more skills have an identical score, the secondary sort is calculated on total number of reviews. The more the better. So in the example above, the skill with three 4-star reviews would show before the other skill in the list, since its three reviews beat the two reviews of the other skill.
- Finally, if two skills are tied in both of the previous two categories, then the user viewing the skills will see the tied skills randomly sorted relative to each other each time they load the companion app. (As an aside, there's a caching bug in the Alexa browser site that can make certain skills in this situation disappear completely)

## So, what's the problem?

*The Wayne Investigation*and

*Magic 8-Ball*should be on top with their 5-star ratings and multiple reviews. It turns out, though, that Amazon only shows stars in half-star increments. Anything above an average of 4.75 shows up as 5-stars, anything between 4.25 and 4.75 displays 4-and-a-half-stars, and so on. If we dig deeper into the reviews for

*The Wayne Investigation*, here's what we see:

__extremely__well reviewed. This rating exists over 19 reviews of the skill. Yet as of now, this skill is considered to be

*less interesting*to the average user than a skill that has been reviewed a single time.

**worth more**than the (presumably) hundreds of hours Warner Bros put in to this very well received game that is helping to move the state of Alexa development forward.

__only been reviewed by the creator of the skill itself__. Literally zero people who didn't create the game have stated their interest in it, and still it is deemed to be more valuable than

*The Wayne Investigation, Magic 8-Ball*, or any other commonly-and-highly-rated skills.

## Less Batman analogies, more algorithm, please

**been said at least once to be exceptional**, and which have

**not yet been said to be unexceptional**. This system is flawed in that skills which have not been reviewed to a level of statistical significance are the most likely to meet this criteria.

**people have consistently found to be valuable**, and that

**people have not consistently found to be broken or worthless**.

To achieve this, I propose the following algorithm to calculate the canonical value of a skill based on its reviews (where a higher score is better):

- m is the mean possible score (currently, 3 stars)
- r is the rating value of an individual skill review
- modifier(x,y) is a an arbitrary function that allows us to weight each review's contribution to the score in a non-linear fashion

To start out, lets assume that we do want things to scale perfectly linearly, so we just define modifier(x,y) to be:

- The Wayne Investigation - 36
- Magic 8-Ball - 19
- TIE: Utterly Body Quiz / CheerLights - 2

## The Modifier Function

*a,*where

*a*is some tunable real number. For the sake of argument, lets imagine Alexa's rating range was a bit broader and allowed for ratings from 1-star to seven-stars (with a mean possible rating of 4-stars). For possible reviews of 5, 6, or 7-stars, how would that modify each review's contribution to the overall skill's score? If we start with a = 0, we see that the output from the original formula is scales exactly the same way as it did when we just assumed the output of modifier(r,m) to be 1:

*a = 0*, the output of the modifier function IS always 1. This is the linear version of this equation. If we shift

*a*into the positive territory, we start to see that higher ratings become exponentially better than their more median brethren:

*a = 1*, the formula is basically following an x² pattern. At

*a = .5*, each point above the mean is calculated as x√x. In each case, the highest possible reviews also add premium value to the overall score. As you might imagine, we can achieve the inverse as well:

*a = -.5*you'll note that it's still good to get higher scores, but being two points above the mean is not even twice as good as being one point above it, when the final score is calculated. And you get a really interesting case at

*a = -1*, where the impact of ANY positive review on the overall score is just going to equal 1. Concordantly, any negative review would mean a single point deduction from the overall score. Any value lower than -1, and you'd end up in a weird territory where it's actually

*worse*to get the highest possible review than a slightly good review - weird.

## So what modifier should we use?

*a* approaching infinity.

The question of what we "should" do really depends on which aspect we find more important. If it were up to me, I'd probably start with a linear implementation, see how things turned out, and then tune it accordingly.

## What happens to unreviewed skills?

## What are the downsides

- No matter how well you tune a system, people will figure out how to game it. One weird benefit of the current system is that it would theoretically be easy for any casual observer to ruin the schemes of someone cheating their average score.
- It doesn't account in any way for the variability among reviewers. One interesting things we've actually seen multiple times when researching posts about reviews is that some reviewers just will not give out full scores. The textual components of their reviews will be chock-full of effusive praise, and somehow they still deign only to grant 4-stars. This is a topic we can (and likely will) talk about at length.
- Skills are coming out fast enough now that even fixing sorting-by-review isn't going to solve all of the discoverability issues going forward. Some sort of faceted classification or search/sort/filter combination is going to be needed eventually.