3PO-LABS: ALEXA, ECHO AND VOICE INTERFACE
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact

3PO-Labs: Alexa, Echo and Voice Interface

VUXcellence: Working with Barge-Ins

5/31/2020

1 Comment

 
In voice user interfaces, we often operate under the assumption that the dialog will happen in turns. This doesn't exactly track real world language, though, and so VUI has a notion called "barge-in" to describe the case where the user interrupts the interface's output. This can be a potentially powerful feature, but it also has consequences that can be difficult to work with. In this article, we explore one side effect further.​


Barge-in as a problem

This post comes about as a result of a wonderful Twitter dialog with several of the top Alexa developers. The question posed was as follows:

Consider the two interactions below...
User: Alexa, launch foehammer
Alexa: Welcome back what would you like to do?
User: Smash my foe
​Alexa: You smash your foe right in their dumb face. Hey, did you know that you can also befriend your foe, just say "befriend"?
User: Smash my foe again...


User: Alexa, launch foehammer
Alexa: Welcome back what would you like to do?
User: Smash my foe
Alexa: You smash your foe right -- <INTERRUPTED BY BARGE-IN>
​User: Alexa, smash my foe again...

These two interactions will trigger the exact same intents, with the same slots, in the exact same order. From the skill's server side, they are functionally identical. Semantically, though, these are not the same. In the second scenario, the user never actually heard the prompt about befriending their foe, and therefore they do not know that it's even an option. The problem, though, is that it's important to the skill that the user receives this information, but there's no way to confirm that the user did indeed get the message.

This is one face of a broader class of problems summed up as: "The skill thinks the user knows something that the user does not actually know". A much more common case of this problem occurs when, for example, a user doesn't use a skill for a long period of time and then returns.

​But today we're talking about barge-in, and what we can do to solve the problem of barge-in preventing users from hearing important information.

The easy solution

As this is a long article, I'm just gonna go ahead and give away the bottom line here near the: There's an easy solution. That easy solution is one that isn't currently available to us.

The voice assistant provider absolutely knows whether or not an intent started with a barge-in or via a user waiting until the microphone was opened normally. They just don't pass that information along today. Ideally, each intent in a live session (note that this problem does not apply to one-shot invocations) would have a simple boolean attribute in the request payload saying whether the response output was completed. You could theoretically take it further and ask for a timestamp or the part of the response where the barge-in happened, but the single flag really solves the 90% case.

But, at least on the Alexa side, we don't have that today, so lets talk about a few things the dev community has been thinking about in its absence.

If you know the minimum length of time it takes to play the TTS content for a given intent, you could record the request timestamp and save it in user session. Then check on the next request, to see if diff was less than that (ie "barge in") & then you could mark it for replay.

— Mark Tucker (@marktucker) May 31, 2020

Playing the timing game

A few different folks called this out, but if you're really interested in trying to understand whether the user heard your full response, you're not completely out of options. If you know how long the audio rendered by the text-to-speech engine (which we'll denote that as T) is going to be for your next request, then you can definitively make the following statement:

"If the next response in this session occurs at or earlier than current time + T, then the response occurred by means of a barge-in"

Realistically, you can probably stretch it even a little further with a high level of confidence by having a good sense of what the minimum overhead for a request-response cycle is. But what you definitely cannot do is make the opposite statement:

"If the next response in the session occurs later than current time + T, then the response did not occur by means of a barge-in"

The reason here is that response time is an aggregate of many variables, and there are any number of combinations that might get you to a larger-than-T response time. Lets enumerate:
  1. Different users have different internet speeds, which may result in different amounts of time for the audio to stream. This happens both in terms of the rendered audio going to the user, as well as the gathered audio going back up to Alexa's ASR engine.
  2. As noted by Jeff Blankenburg, Alexa provides a mechanism to have output text read at different speeds by default, which fundamentally undermines the assertion that you can calculate T.
  3. Barge-in generally requires a wake word. A user with a wake word of "Echo"  (2 syllables) is going to take less time to interrupt their smart speaker than one who has to say something like "Okay Google" (4 syllables).
  4. Further, the common pattern that I see users follow is something like "Wake word, brief pause to ensure recognition by the device, utterance", but that middle pause is variable for every different user.
  5. The device being used also inserts its own latency - a 2nd Gen RPi running Alexa Pi is not going to be as snappy as a newest gen Echo.
  6. Speech-to-text time is also not a constant - high traffic can certainly introduce a delay in a user's response making it back to the skill.

All of these variables pale in comparison to the three biggest factors, however...
  1. The user's choice of words is going to be massively important in measuring the response time. Even for something simple like a yes or no response, there's a big difference between "no" and "no thank you, Alexa".
  2. Along those lines, the user's spoken cadence matters - a slow-spoken user is going to have a longer turnaround time than average.
  3. Finally, you may have noticed that sometimes - especially when there's a lot of background noise - your smart speaker has a hard time knowing when the utterance is complete. Those two seconds waiting for the microphone to close are a significant portion of the total response time.

So, sure, if your 4.3 second message garners a response 5.8 seconds later it's possible that the user heard everything and replied with a terse "yep". But it's also possible they barged in at 3.6 seconds, had a couple seconds of response audio, and added 400milliseconds of overhead from their slow internet and old device.

And there's another important factor here, which is that the entirety of this technique is predicated on accurately knowing your TTS time for a given input string, which does not fit well at all another best practice - generative text. If you are building your response strings to sound more human by variably prepending acknowledgement words ("Hmm...","Ahh,") and by having multiple phrasings for each response string, potentially combinatorially built from smaller connecting phrases, then the odds are low that you'll ever actually have an accurate estimate of your TTS time.

In general, there may be some value here in terms of a strict "didn't hear" cutoff and then a secondary "technically might've heard, but probably didn't" confidence interval, but for the most part the number of variables involved is going to make the feature extremely inaccurate, and implementing such a system would be a lot of work to begin with.

Perhaps have them say a phrase of the day to get an in-game bonus? The phrase is at the end of the announcements.

If they start playing w/o getting the bonus... "One golden egg is available if you say the phrase of the day. Would you like to learn it?"

— Greg 'Papa Oom Mow Mow' Bulmash (@YiddishNinja) May 31, 2020

Incentivizing listening

Alright, so if we can't easily identify barge-ins, maybe the better alternative is to just incentivize people to listen 'til the end. Greg Bulmash proposed this fairly novel approach that you could use for some specific cases, wherein you give people a reward that they can only collect by listening to the full message. The specific context here was around a use case where notifications are appended at the end of the Launch response for a game skill, and I think that his idea works marvelously under those constraints.

Unfortunately, it probably isn't broadly applicable to the entire problem space. First, it benefits from being at the same point in the skill flow every time, which means the users know they should be listening for it. This is great if you can pull it off and always have a dedicated "new stuff goes here" segment, but in a lot of cases you need to provide new information to the user inline, and triggers that happen in the normal course of interaction can't wait until the next time the user ends the session and starts a new one.

Further, a lot of skills won't have any levers to pull in terms of incentivizing their users. For a game skill, sure, you can usually find a way to give a way a little bit of content or something. But what about for a utility skill, where the user is just trying to complete a discrete set of actions? What would the "Golden Egg" be in a skill about recording your pet's circadian rhythm to help them get their best sleep? (That's a free idea for any of you to implement. You're welcome.)

Finally, there's a potentially-good/potentially-bad follow-on effect of this approach. When you train a user to interact with your voice user interface in a certain way, you kind of have to assume that you're training them to use it that way across the entire experience. So if you're teaching them that barge-in is bad during launch, chances are that they're also going to apply the "barge-in is bad" mantra across the rest of the experience. Is that really what you want? I don't think that's by any means universal, however. Gal Shenar made strong arguments about this exact point on the Alexa Slack instance of this discussion. And in fact, there are skills where barge-in is fundamental to their UX - St. Noire is a perfect example.

To add in: 1) For us, this goes back to the overall design. When and how are you introducing new features? We use sound effects when there’s something new, so people know the listen up, and then we try to keep the same flow and rhythm of the experience.

— Sarah Andrew Wilson (@SarahAndrewWils) May 31, 2020

Won't somebody please think of the earcons?!

Sarah Andrew Wilson's take on the problem was to go down the signaling route. Rather than training your users that "on launch, always wait til the end", you can instead teach them that "when you hear (custom sound/tone), that means we've got an update for you". This is very similar to the very first VUXcellence topic we wrote about, and if you have the sound-design resources to be able to do it I think it's a fabulous approach. Training a sort of Pavlovian response into a user in this way is not trivial to implement, and it requires you have fairly engaged users to begin with (otherwise their attention will lapse before the lesson is ingrained). On the flip side, you don't have to try to do hacks to figure out when your message has been heard, nor do are you requiring your users to listen to content they don't need strictly need to hear.

Alas, I don't see a lot of voice applications taking this sort of advanced audio signaling approach yet. If we do get to a point where it's commonplace, that probably brings with it concerns about audio information overload. Or, maybe this is something that leads to standardization? Could there be a scenario where multiple skill makers use the same set of sounds to mean the same things?

A good question to which I have no good answer except to try to write prompts in a way that reduces the likelihood of barge in eg don't introduce a new feature at a point where the user thinks they already know what they're doing. Doesn't help if people decide to exit though...

— Tom Hewitson (@tomhewitson) May 31, 2020

Reducing the desire to barge-in

If Greg Bulmash's approach was "give the user a reason to stick around 'til the end", Tom Hewitson took pretty much exact opposite approach toward that same end (making sure that the user hears the message). His take is to minimize the chances your user is going to have a reason to barge in and miss the memo. 

There are a couple of ways you can optimize for this. The first, as Tom notes in his Tweet, is to just avoid dropping important information at a time when the user is performing a task that lends itself well to barging to begin with. That might be more easily said than done, but I think as a general rule of thumb it's something you can tack on to pretty much any other approach you'll use. It's worth noting here, too, that there's sort of a chicken-and-egg problem: How do we reliably identify which of our contexts garner the most barges-in when that attribute isn't passed along to us? And the flip side of this is that it identifies a secondary benefit to theoretically providing the barge-in flag - it will allow us in aggregate to figure out which points in our user flows are most annoying our users!

On that note, the other way that you optimize for Tom's approach is to just not give your users any reason to barge-in. Make sure that, as often as possible, your users are not hearing extraneous output if they don't want it (this last part is important - some users DO want it). This topic by itself is worth a whole series of VUXcellence articles, but I'll take a moment to toot my own horn here and mention that the optionally enabled "short mode" I implemented in CompliBot and InsultiBot is one of the most meaningful features I've ever built, and I'd highly recommend that any skill that strives to be highly engaging go through the effort of letting their users configure how verbose of an output they hear.
Picture

Use it to prove it

Finally, I wanted to give a shoutout to Gal a second time for his thoughts on entirely reframing the question in a conversation on Alexa Slack. As a reminder of the original problem, I wanted to make sure my users heard a message about doing a new action, so that I could set a flag saying "this user has been notified about X, don't notify them again". Gal's point was to sort of flip that on its head and assert that the flag shouldn't be set when you tell them, it should be set when they utilize it the first time. Essentially, regardless of what you've told them, you should assume that the user does not know how to take an action until such time as they take that action.

There's an obvious downside to this, which is that a user may actually hear the message, understand it, and choose not to interact with it for the time being. In that case, the user will continue to be nagged about it on future iterations of whatever it is they're doing (sort of violating the exact thing I just praised myself for in the section above). But users will pretty universally understand that they also have an easy way out of the nag - just do the damn thing. You can think of the recurring nag as maybe the voice equivalent of an "unread" notification on a mobile app.

My current approach

This conversation was hugely helpful in clarifying my thoughts on the problem, so I wanted to share what I'm thinking in terms of design. Given our same hypothetical skill above, I'd instead change the prompt to be something like:
User: Alexa, launch foehammer
Alexa: Welcome back what would you like to do?
User: Smash my foe
​Alexa: You smash your foe right in their dumb face. Hey, turns out you also have the option to befriend your foe. Wanna hear how?

Current flow

This combines a few of the above approaches. It relies most heavily on the proof of understanding approach, but in this case I offer them a boolean and if they interact with that boolean in either direction I count that as proof that they have considered my prompt. It doesn't force them to do the thing they don't want to do (they can say no), so hopefully the nag won't be too overbearing. And for those users who did barge-in and therefore didn't hear the prompt, they'll get it next time.

There's still the aforementioned downside of the user who does listen, but just wants to keep on swinging his foehammer, and therefore keeps hearing the nag over and over. There's also one sneakier problem that I don't have a solution for. Consider that instead of saying yes or no, the user might instead say "help". This goes in to my help intent handler, and since the prompt was about befriending, I route it to "help about befriending", which as it turns out is exactly the same thing as just answering "YES" to begin with. But what if the user barged-in, didn't hear the prompt, and then asked for help thinking they would get tips about smashing faces? They'd then be presented with an explanation of befriending, completely out of context, and I suspect it would be quite a disorienting situation.

In general, though, I think the edge cases here are greatly outweighed by the value of this UX, and I'm pretty happy to move forward with this as my plan.

I'm curious what the rest of the community thinks about the ideas laid out here. Are there other approaches we haven't stumbled upon?

I also dropped a feature request for this on Alexa Uservoice, if you care to give it an upvote: 
https://alexa.uservoice.com/forums/906892-alexa-skills-developer-voice-and-vote/suggestions/40554694-provide-barge-in-flag-on-request
1 Comment
Jo Jaquint
6/9/2020 06:11:10 am

Several years ago when StarLanes starting using pull advertising, I began to suspect people were using barge in to skip the advertising. I considered using timing to detect this. It's an optimal condition since the ads are 30 seconds or so long, and all the hair splitting in times you talk about aren't a big deal. But, in the end, it wasn't worth it. It's one of the few ways that users can adjust the user interface to their own liking, and we felt it best to let them retain that quantum of control.

We've also done the style of selective prompting you've described for several years. But, again, barge-ins have never been much of a concern. First off, as we recommend in our talks, and you have echoed, if it's really important, you need to put the information first. Secondly, people rarely use barge-ins unless they are habitual players who already know the system well. So, again, if they are using a barge-in, it is usually their way of tailoring the user experience to something they want. And, given this, it doesn't really matter if it perturbs your degraded prompts calculations, since, for an experienced user, it is unlikely to be relevant.

And the whole concept of "train your users to listen to the whole message" is a bit daft. Every voice prompt system starts with "please listen to the whole message as the prompts may have changed". But no one *ever* pays attention. The message is not there for the user experience, but for their legal coverage. And a voice prompt system is very different from a casual voice assistant. You don't have the same leverage over your user. So it's never a user experience I would follow.

Even after all of that, you conclude it is something you want to address, I'm not sure your final dialog is the best. I would stick to our "important things first" rule and phrase it like this:

User: Smash my foe
​Alexa: Again? You can ask about befriending instead if you like. But, anyway. You smash your foe right in their dumb face.

The user _wants_ to know if they smashed the foe or not. They _aren't_ going to barge-in on the message until they hear that. So you've got that little bit of leverage to tell them what *you* want them to hear.
But you can't over-use it. Remember: a barge-in is the user's way of altering the user interface to what they want it to be. If you mess with that too much, they are going to be less satisfied with their interaction. If you are concerned about barge-in, consider why the user is doing it, and if there is a better way of conveying your information so they don't have to.

Reply



Leave a Reply.

    Author

    We're 3PO-Labs.  We build things for fun and profit.  Right now we're super bullish on the rise of voice interfaces, and we hope to get you onboard.



    Archives

    May 2020
    March 2020
    November 2019
    October 2019
    May 2019
    October 2018
    August 2018
    February 2018
    November 2017
    September 2017
    July 2017
    June 2017
    May 2017
    April 2017
    February 2017
    January 2017
    December 2016
    October 2016
    September 2016
    August 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015

    RSS Feed

    Categories

    All
    ACCELERATOR
    ALEXA COMPANION APPS
    BOTS
    BUSINESS
    CERTIFICATION
    CHEATERS
    DEEPDIVE
    EASTER EGG
    ECHO
    FEATURE REQUESTS
    MONETIZATION
    RECAP
    RESPONDER
    TESTING
    TOOLS
    VUXcellence
    WALKTHROUGH

Proudly powered by Weebly
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact