3PO-LABS: ALEXA, ECHO AND VOICE INTERFACE
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact

3PO-Labs: Alexa, Echo and Voice Interface

A More Graceful Fallback

3/18/2020

1 Comment

 
Towards the end of last year, I implemented a feature in a few of my skills that was meant to chip away at one small corner of one of the biggest problems in the VUI space: conveying application context. It was an idea I had been toying with for quite a while, but a few factors made the time right to implement it, and I'm glad to say it's been hugely successful! Read on to hear about what I built, and why...


On the issue of context

Before I jump into explaining what I did, I want to talk about the bigger problem space. As the Alexa interface has grown, both in terms of first party features and third party skills, it has become increasingly apparent that we don't have a solution yet for conveying a user's context to them through an audio medium. What I mean by this is that a user interacting with Alexa often doesn't know which part of the system they are talking to, what information that feature has about them, or what options are available to them at that given time.

For people focused on usability and user-centric design (for any arbitrary interface), many of the best practices stem from Jakob Nielsen's seminal 10 Heuristics for User Interface Design. The first heuristic in that list is "visibility of system status". To wit:
The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.
The issue of why it's hard to follow this heuristic could fill a whole series of blog posts by itself, but suffice it to say, it's not a solved problem. This problem rears its head in a number of frustrating ways. A few very common places where we see this:
  1. Users of skills with multiturn interactions will often start speaking to the device before the microphone has been opened up to them for input, resulting in only partial speech making it to the intent handling part of the system. This is because many users don't actually understand when they are allowed to talk and when they are not. (This is known as the barge-in problem).
  2. If you've ever wondered why Amazon's rules around in-skill advertising are so weirdly specific about requiring third-party voices speak the ad, now you know - it's about context! That rule is addressing the fear that users may not understand who the advertising is coming from, and instead associate it with official endorsement by Amazon rather than an ad chosen by the skill they're using.
  3. With the number of skills on the store now numbering in the six-figures range, one of the most frustrating things that developers see is our skills receiving poor reviews from users who actually used skills made by our competitors. This is particularly true in some of the highly-competitive genres, such as ambient sounds.
  4. As a way of trying to solve the also-critical issue of discoverability, Amazon at some point began randomly appending recommendations for other skills onto the end of some skill sessions. Users, disinterested in being prompted in this way, regularly take that out on skill developers, despite it being something entirely controlled by the Alexa team and for which we have no opt-out.

As a skill developer today, you have to assume that a significant portion of your customers don't even know that your skill was not something Amazon built, and that when they're using your skill they expect that they can seamlessly context shift and ask about anything they could normally ask about from the top level.

A bit of mitigation

This is a hard problem to solve, especially in a space where we can't just fall back on visual affordances or decades of design language, but not all hope is lost. Just like an overview of the problem space could warrant thousands of words, the solutions to the context problem our voice first community is pursuing are myriad and varied. A few highlights I'd like to call out:

Amazon itself did something super clever when it launched the Echo, which is that it took a simple LED ring and found a way to convey a bunch of context with it. While many users still haven't learned the nuances, most of us understand that the spinning ring means one thing, the solid light means another, and no light means something as well. While this isn't really a solution to context in an audio medium (as it's relying on a visual indicator), it was still an extremely important design choice. Many Alexa-enabled devices today have screens, but even so you still see design history going back to that initial implementation.
Picture
In the absence of any visual indicator, though, we're forced to use what's available to us, and a few skill builders have taken to heart the challenge of disambiguation by providing audio that is unique and unambiguously their own. An early and high profile example of this came about as a way of addressing the problems stemming from near-identical skills described above. Nick Schwab explains:
As competition for sleep sounds skills increased, it became clear that we needed to help Alexa customers identify genuine Invoked Apps experiences - and give assurance to the millions of customers who had already fallen in love with our apps since 2016. To do this, we created sonic branding through the Invoked Apps Earcon (a soft chime heard when launching an Invoked Apps voice app).
Taking the idea of unique audio branding even further, a burgeoning area of exploration is around using much more customized text-to-speech solutions. If a skill is willing to go to the extent of foregoing all of Amazon's neat TTS features they can use a fully synthesized voice and be positive that a user will be aware of any context shift. Scaling back, though, even something like using a Polly voice or just the available voice emotion tags can provide for enough of a shift that the user is aware of the change.

Enough context about context

Alright, so that's all a huge problem space, and I definitely don't have solutions to most of those problems. The problem I was trying to ease for my users was a very specific one. The facets of it are:
  • Often, users will not realize that they are still within my skill;
  • Alternately, some users will know that they are in my skill be not realize that they can't do non-skill stuff from within that context;
  • Some users don't have a good grasp of how they are supposed to transition from a skill back to Alexa's top level within a given session.
  • As a result, I would often have users repeatedly slamming the Fallback intent (and presumably having very poor experiences) by saying things that Alexa can handle at the top level but that my skill cannot handle.
  • Because the Fallback intent does not tell you in real time what the utterance was, I had no way of knowing which users were stuck in this befuddled state, and which users were just poking around or looking for the various easter eggs that live in my skill's voice interface.
When I write the problem statement in that way, it seems obvious what the solution was, but it took a fair bit of shower musing and bouncing ideas off folks on Alexa Slack before I decided on the approach I eventually went with.


Enhanced Fallback

I had been thinking about how to solve this problem for a while. I had already implemented what I described as my "frustrated user" flow for CompliBot and InsultiBot (where I go into an extra hand-holdy mode if the user triggers Help or Fallback too often), and was surprised to find how often users were getting themselves into that state.

The thing that triggered my final solution though was remembering back to our first certification pass for InsultiBot (in 2015!), where we had originally included an easter egg in our voice model around asking for the weather. One of our testers had gone out on a whim and asked it, and we thought it would be funny if InsultiBot knew how to respond to that question, despite it not being part of the core competency. We ended up pulling it out at the last minute to reduce VUI clutter and because we had no way of measuring false positives or false negatives in the ASK at the time. The connection I made years later, though, was that the "easter egg" was functionally a highly-targeted Fallback. And once we had the ability to see what requests were coming in and getting mapped to specific intents, it became easy to throw easter-egg type stuff back into the voice model.

​​So, the next time I scoured my intent history for future model tweaks, I kept a special watch for repeated patterns in the Fallback intent. What I found was that there were a few requests that came up over and over again, and in fact one of them was requests for weather, just like way back in InsultiBot v1! There were also four other common utterance groups: setting a timer, asking for a joke, asking to sing a song, and trying to launch another skill. I combined joke and song into a generic "performance" bucket, and I deferred the potentially more tricky "launch" requests for later, leaving me with three fallback types that I'd build specific content for - weather, timers, and performance.
Using common but unexpected utterances on InsultiBot triggers the Fallback intent
You can see here what a generic FallbackIntent response might look like, for the "launch" bucket which I skipped in this PoC
I began referring to this flow as "enhanced fallback", and set about decided what needed to be in any response to one of these triggers. Per the problems I was trying to solve above:
  1. I knew I had to have a bit that just outright explained to the user that they were still in my skill's context.
  2. I had to explain to them the way out, in case they were stuck (I'd never make it as a Vegas casino designer).
  3. Now that I knew exactly what they were asking for, I could still fulfill the skill's core competency of providing a fun, on topic quip.

Implementation Details

Once I knew the direction I wanted to go, things flew together. The voice model was easy - these were simple intents with no slots where I'd just copy and paste examples from my intent history as the sample utterances. The hardest part was probably coming up with randomized output generation that sounded fluid no matter what combination I used, as I had distinct mix-and-match lists for the "You're still in {bot}", the "here's how you leave", and the easter egg parts of the response, but even that was fairly trivial. But having built-it, I actually sat on it for quite a while without releasing...
A weather request to InsultiBot triggers a special enhanced fallback weather intent
Unlike above, asking InsultiBot for the weather triggers a special flow
You see, that initial decision, from way back in 2015, to pull it out in favor of simplicity of the model still sat with me, and in fact we actually had to pull more of our meta-intents at a later point because they were being hit too often. I wasn't comfortable introducing a feature that was going to berate a user for not knowing how to use the skill, and give them a longwinded explanation on what to do, if there was a chance they were going to trigger it when they hadn't actually screwed up.

Then in the second half of last year the solution was handed to me on a silver platter. I ended up in the NLU Bulk Evaluation Tool beta, and which allows a way to do large scale, potentially combinatorial testing of your voice model. This meant I could put my background in test to work to actually validate that adding these fun new intents wasn't going to accidentally draw users away from the intents they meant to invoke.

Still, though, I was paranoid that I was going to break things, so when I initially released it I rolled it out only on InsultiBot (the better-content-but-lower-traffic of the CompliBot/InsultiBot dyad).

Turns out I had nothing to be worried about, and it's been hugely successful. I almost never have weather/timer/performance requests making it into the Fallback anymore, and I have yet to see a single false positive hit one of my three EnhancedFallback intents. A normal intent history filtering on those looks more or less like this:
Picture

Taking it further

So, how do we take this further?

For my part, I mentioned that there was a fourth bucket that I had ignored, which is people making "play" requests. I skipped that one because I didn't want to deal with trying to solve the disambiguation problem between "game" and "song" that Amazon has been dealing with since the very beginning. But it seems like there's a subset of those requests I could probably pull into a fourth enhanced fallback bucket - those that use the "launch" phrase instead. Beyond that, I think this feature has shown itself to be valuable enough that I should backport it to other skills where it could be valuable.

I'm a little surprised that Amazon hasn't really made a move to try to solve this in a more common way across all skills. There are a variety of levels at which they could insert themselves, from just providing a built-in intent so we don't need to manage lists of weather utterances, all the way up to finding a way to make all features available via a default backstop even while in a skill context. I suspect that we'll see something in this area this year, but I still think it's well worth the time for skill devs to implement solutions in the meantime.

There's a second piece, here, which is that part of the reason users don't understand the 1P feature vs 3P skill dichotomy is because of how Amazon has chosen to brand and market the ecosystem. They seem to be caught in a sort of perpetual cognitive dissonance, between wanting to train users about the great things that skill builders are doing, but also wanting to provide customers an experience that "just works", and these two motivators are often at direct odds. Choosing to actively promote features and marketing that makes clear the distinction between functions provided by Amazon, Skill Builder A, and Skill Builder B could certainly help orient users, but at a significant cost to other aspects of usability.

Finally, one thing that I've seriously considered is actually trying to solve some of these requests on behalf of my users rather than just giving them a funny response. Amazon has very strict rules about skills acting as "marketplaces", but at the very least the idea of using skill connections to let Big Sky handle the weather request makes sense. If that feature provided a true hand-off capability, instead of the (largely panned) partial measures we currently have, I'd probably actually do it (and then naturally charge Steve Arkonovich for all that traffic I'm sending his way).

So what do you guys think? I'm sure I'm not the first one to come up with this idea - I'd love to see how others are approaching it.
1 Comment
zipjob resume review link
9/13/2020 08:41:48 pm

I am so happy that I can chat with Rachael. I have been a fan of hers for quite some time now. To be honest, there is nothing that I would not do for her. I am one of her biggest fans, and I always try to give her what I can. I know that my support is not that much, but it is all that I can offer. I hope that I can keep chatting with her from now on.

Reply



Leave a Reply.

    Author

    We're 3PO-Labs.  We build things for fun and profit.  Right now we're super bullish on the rise of voice interfaces, and we hope to get you onboard.



    Archives

    May 2020
    March 2020
    November 2019
    October 2019
    May 2019
    October 2018
    August 2018
    February 2018
    November 2017
    September 2017
    July 2017
    June 2017
    May 2017
    April 2017
    February 2017
    January 2017
    December 2016
    October 2016
    September 2016
    August 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015

    RSS Feed

    Categories

    All
    ACCELERATOR
    ALEXA COMPANION APPS
    BOTS
    BUSINESS
    CERTIFICATION
    CHEATERS
    DEEPDIVE
    EASTER EGG
    ECHO
    FEATURE REQUESTS
    MONETIZATION
    RECAP
    RESPONDER
    TESTING
    TOOLS
    VUXcellence
    WALKTHROUGH

Proudly powered by Weebly
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact