- The Feature
- The Problem
- Technical Details
- Top-Level Invocations vs Skill Invocations
- Mitigation Missed
- Testing
The Feature
The feature itself actually has two different interfaces - there's the standard ability to launch with an invocation name 'Alexa, tell Xbox to record that', but it also has a smart home interface. This in and of itself is interesting, as developers have been looking for a while for a way to be able to tie our brands skills together via a single skill entry when there are multiple types of interfaces (think: a news agency's custom skill that also has a flash briefing). It seems Microsoft got special treatment or access to an unannounced new feature in that regard.
If you check out the sample utterances they provide, it looks like this:
The Problem
Technical Details
- The device triggering the interaction is a display device (in my case, a first gen Echo Show).
- There is some sort of media service (often FireTV) tied to the same account as the request.
With a bit of digging, it turns out that this is not technically a new problem, rather it's something that was around previously, which is now being exacerbated by the XBox addition. And that's where we come back to that final line in the XBox utterances, the launch phrase. By adding the ability to say "Alexa, launch <name of XBox game>", rather than "Alexa, ask XBox to launch <name of XBox game>", Amazon has created an ambiguous situation.
Early on, "launch" and "open" were exclusively reserved for launching skills, but at some point that stopped being true. Previously, the most common case where this would occur was something like "Alexa, play <name of an Alexa game>", where the "play" trigger was ambiguous between requests for music (or videos on the Show) and skills. Generally speaking, media content always wins in that case, and skill developers learned to coach our users to not use the word "play". Ever. Now, the Alexa media catalog has been greatly expanded with all of Microsoft's offerings, and the words "Launch" and "Open" have been commandeered. That's not good for developers/users of skills.
Top-Level Invocations vs Skill Invocations
Among the development community, we have a term called "Top Level Invocations" (or sometimes "name-free invocations" or "super editorials"), which refer to the utterances that let you invoke a skill without saying its name. Rather than, "Alexa, ask CompliBot to compliment me", you can just say "Alexa, compliment me", and sometimes have CompliBot serve up a session, for example.
And really, these TLI represent the ideal of Alexa: The idea that you can just ask for something, without namespacing it, and get what you want. To wit, at no point in Star Trek did you ever hear Picard say "Computer, tell Unofficial Starfleet Skill to engage self-destruct".
Plus, virtually every developer has had users tell us, straight-up, that they don't use our skills because it's too hard to remember names. But solving that is a non-trivial problem. It's not like we can just take everyone's interaction model and cram them together at the top-level - there are some intents implemented by every single skill, so it would become a disambiguation nightmare.
Amazon is making progress, however. Expanding out their normal first-party features with TLI managed by their marketing team (like the Compliment one above) was one of the first steps. FlashBriefings were another way to serve a specific piece of this puzzle - making it so users didn't need to query multiple news or podcast skills sequentially anymore. More recently, they've opened up the CanFulfillIntentRequest as a way to add more things to the top level.
But every time they add these features, they are creating additional work for developers, whether that's fighting for marketing slots, building FlashBriefing versions of their skills, or implementing CanFulfill. Maybe more importantly, every time they add a new request or type of request to the top level, they're increasing the complexity of the overall system and the odds of unexpected collisions. It's definitely a rock-and-a-hard-place type situation.
Additionally, each of these solutions creates several of its own independent problems, often involving a chicken-and-egg where a new skill builder can never garner the attention necessary to take advantage of these features by virtue of the fact that they don't have the attention to begin with. But that's a topic for a whole series of other posts...
Mitigation Missed
But luckily the science is such that we have other ways of intuiting what the user might've wanted. A big push among the Alexa team lately has been to better understand and utilize context. "What was the user doing when they made their utterance?", or "What else do we know about the user that might help us make a better choice?". This is likely where the "is it a display device" and "do they have a FireTV" aspects come into play - the failure seems to be that these signals were being misinterpreted and then weighed far too heavily. Unfortunately, they also either completely missed or didn't properly weigh what I believe to be the most important signal of all: the historical success of the mapping to the existing skill.
As I mentioned, One Bus Away is one of the more tenured skills on the store. I've been using it five days a week for two years. All of that historical data should add up to tell Amazon that the mapping of the invocation "open one bus away" to the concept of launching the skill named "One Bus Away" was working really well. If it had been acting incorrectly, it is reasonable to assume that I would've stopped using that utterance, rather than continuing to invoke it with a regular cadence. Now, this isn't always going to be the case. Matt Kruse's brand new game will have no history of usage, but that doesn't mean Need for Speed" is a better match for "open speed tap" than his game literally named "Speed Tap". But in the case of skills like One Bus Away or Big Sky that lend themselves to consistent usage, it's inexcusable to have any new Alexa feature interfere and siphon away committed users who are calling a skill by its name.
Finally, I hold what is a slightly more controversial opinion on the matter, which is that Amazon is creeping closer but has not yet reached the point of being fully committed to Alexa Skills as the primary feature of their platform - they're still hedging. The crux of the argument is something like this: "If you are running a platform, and you can choose to promote quality content that is native to your platform, or quality content that is imported from outside of your platform, which do you choose?"
At early stages, the answer is almost always the latter - you want people to care about your platform so you entice them with things that they are familiar with. Many people associate Alexa's first real "win" with the 2016 Superbowl marketing push where they dropped the Dominos and Uber skills, both of which existed on numerous platforms outside of Alexa, and neither of which did a particularly good job showing what was so great about Alexa. At some point, though, you have to pivot and start focusing on what you can provide that others cannot (see: Netflix). Otherwise the most you can hope to achieve is parity, not primacy.
In the case of Alexa, that is skills. Until the day that voice-first companies and Alexa-native skills are the headline, and all of the other features are the subtext, we (skill devs) haven't "made it" yet. It's up to Amazon to determine when they're ready to shift the conversation in that way, because they're the only ones who can do it.
There's one last facet of this whole thing, and it's my standard rant: testing...
Testing (yes, again...)
And a few bullet points:
- This sort of thing has happened before (Jan. 2018 comes to mind), and it'll happen again.
- Testing through the CLI would not have caught any of this
- The fact that we're getting intermittent reports of issues from devs implies most people don't know their skills are broken
IT'S TIME, Amazon. Full testing. Audio clips, through the front door, including context (like device type).