On the issue of context
For people focused on usability and user-centric design (for any arbitrary interface), many of the best practices stem from Jakob Nielsen's seminal 10 Heuristics for User Interface Design. The first heuristic in that list is "visibility of system status". To wit:
The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.
- Users of skills with multiturn interactions will often start speaking to the device before the microphone has been opened up to them for input, resulting in only partial speech making it to the intent handling part of the system. This is because many users don't actually understand when they are allowed to talk and when they are not. (This is known as the barge-in problem).
- If you've ever wondered why Amazon's rules around in-skill advertising are so weirdly specific about requiring third-party voices speak the ad, now you know - it's about context! That rule is addressing the fear that users may not understand who the advertising is coming from, and instead associate it with official endorsement by Amazon rather than an ad chosen by the skill they're using.
- With the number of skills on the store now numbering in the six-figures range, one of the most frustrating things that developers see is our skills receiving poor reviews from users who actually used skills made by our competitors. This is particularly true in some of the highly-competitive genres, such as ambient sounds.
- As a way of trying to solve the also-critical issue of discoverability, Amazon at some point began randomly appending recommendations for other skills onto the end of some skill sessions. Users, disinterested in being prompted in this way, regularly take that out on skill developers, despite it being something entirely controlled by the Alexa team and for which we have no opt-out.
As a skill developer today, you have to assume that a significant portion of your customers don't even know that your skill was not something Amazon built, and that when they're using your skill they expect that they can seamlessly context shift and ask about anything they could normally ask about from the top level.
A bit of mitigation
Amazon itself did something super clever when it launched the Echo, which is that it took a simple LED ring and found a way to convey a bunch of context with it. While many users still haven't learned the nuances, most of us understand that the spinning ring means one thing, the solid light means another, and no light means something as well. While this isn't really a solution to context in an audio medium (as it's relying on a visual indicator), it was still an extremely important design choice. Many Alexa-enabled devices today have screens, but even so you still see design history going back to that initial implementation.
As competition for sleep sounds skills increased, it became clear that we needed to help Alexa customers identify genuine Invoked Apps experiences - and give assurance to the millions of customers who had already fallen in love with our apps since 2016. To do this, we created sonic branding through the Invoked Apps Earcon (a soft chime heard when launching an Invoked Apps voice app).
Enough context about context
- Often, users will not realize that they are still within my skill;
- Alternately, some users will know that they are in my skill be not realize that they can't do non-skill stuff from within that context;
- Some users don't have a good grasp of how they are supposed to transition from a skill back to Alexa's top level within a given session.
- As a result, I would often have users repeatedly slamming the Fallback intent (and presumably having very poor experiences) by saying things that Alexa can handle at the top level but that my skill cannot handle.
- Because the Fallback intent does not tell you in real time what the utterance was, I had no way of knowing which users were stuck in this befuddled state, and which users were just poking around or looking for the various easter eggs that live in my skill's voice interface.
Enhanced Fallback
The thing that triggered my final solution though was remembering back to our first certification pass for InsultiBot (in 2015!), where we had originally included an easter egg in our voice model around asking for the weather. One of our testers had gone out on a whim and asked it, and we thought it would be funny if InsultiBot knew how to respond to that question, despite it not being part of the core competency. We ended up pulling it out at the last minute to reduce VUI clutter and because we had no way of measuring false positives or false negatives in the ASK at the time. The connection I made years later, though, was that the "easter egg" was functionally a highly-targeted Fallback. And once we had the ability to see what requests were coming in and getting mapped to specific intents, it became easy to throw easter-egg type stuff back into the voice model.
So, the next time I scoured my intent history for future model tweaks, I kept a special watch for repeated patterns in the Fallback intent. What I found was that there were a few requests that came up over and over again, and in fact one of them was requests for weather, just like way back in InsultiBot v1! There were also four other common utterance groups: setting a timer, asking for a joke, asking to sing a song, and trying to launch another skill. I combined joke and song into a generic "performance" bucket, and I deferred the potentially more tricky "launch" requests for later, leaving me with three fallback types that I'd build specific content for - weather, timers, and performance.
- I knew I had to have a bit that just outright explained to the user that they were still in my skill's context.
- I had to explain to them the way out, in case they were stuck (I'd never make it as a Vegas casino designer).
- Now that I knew exactly what they were asking for, I could still fulfill the skill's core competency of providing a fun, on topic quip.
Implementation Details
Then in the second half of last year the solution was handed to me on a silver platter. I ended up in the NLU Bulk Evaluation Tool beta, and which allows a way to do large scale, potentially combinatorial testing of your voice model. This meant I could put my background in test to work to actually validate that adding these fun new intents wasn't going to accidentally draw users away from the intents they meant to invoke.
Still, though, I was paranoid that I was going to break things, so when I initially released it I rolled it out only on InsultiBot (the better-content-but-lower-traffic of the CompliBot/InsultiBot dyad).
Turns out I had nothing to be worried about, and it's been hugely successful. I almost never have weather/timer/performance requests making it into the Fallback anymore, and I have yet to see a single false positive hit one of my three EnhancedFallback intents. A normal intent history filtering on those looks more or less like this:
Taking it further
For my part, I mentioned that there was a fourth bucket that I had ignored, which is people making "play" requests. I skipped that one because I didn't want to deal with trying to solve the disambiguation problem between "game" and "song" that Amazon has been dealing with since the very beginning. But it seems like there's a subset of those requests I could probably pull into a fourth enhanced fallback bucket - those that use the "launch" phrase instead. Beyond that, I think this feature has shown itself to be valuable enough that I should backport it to other skills where it could be valuable.
I'm a little surprised that Amazon hasn't really made a move to try to solve this in a more common way across all skills. There are a variety of levels at which they could insert themselves, from just providing a built-in intent so we don't need to manage lists of weather utterances, all the way up to finding a way to make all features available via a default backstop even while in a skill context. I suspect that we'll see something in this area this year, but I still think it's well worth the time for skill devs to implement solutions in the meantime.
There's a second piece, here, which is that part of the reason users don't understand the 1P feature vs 3P skill dichotomy is because of how Amazon has chosen to brand and market the ecosystem. They seem to be caught in a sort of perpetual cognitive dissonance, between wanting to train users about the great things that skill builders are doing, but also wanting to provide customers an experience that "just works", and these two motivators are often at direct odds. Choosing to actively promote features and marketing that makes clear the distinction between functions provided by Amazon, Skill Builder A, and Skill Builder B could certainly help orient users, but at a significant cost to other aspects of usability.
Finally, one thing that I've seriously considered is actually trying to solve some of these requests on behalf of my users rather than just giving them a funny response. Amazon has very strict rules about skills acting as "marketplaces", but at the very least the idea of using skill connections to let Big Sky handle the weather request makes sense. If that feature provided a true hand-off capability, instead of the (largely panned) partial measures we currently have, I'd probably actually do it (and then naturally charge Steve Arkonovich for all that traffic I'm sending his way).
So what do you guys think? I'm sure I'm not the first one to come up with this idea - I'd love to see how others are approaching it.