Barge-in as a problem
Consider the two interactions below...
User: Alexa, launch foehammer
Alexa: Welcome back what would you like to do?
User: Smash my foe
Alexa: You smash your foe right in their dumb face. Hey, did you know that you can also befriend your foe, just say "befriend"?
User: Smash my foe again...
User: Alexa, launch foehammer
Alexa: Welcome back what would you like to do?
User: Smash my foe
Alexa: You smash your foe right -- <INTERRUPTED BY BARGE-IN>
User: Alexa, smash my foe again...
These two interactions will trigger the exact same intents, with the same slots, in the exact same order. From the skill's server side, they are functionally identical. Semantically, though, these are not the same. In the second scenario, the user never actually heard the prompt about befriending their foe, and therefore they do not know that it's even an option. The problem, though, is that it's important to the skill that the user receives this information, but there's no way to confirm that the user did indeed get the message.
This is one face of a broader class of problems summed up as: "The skill thinks the user knows something that the user does not actually know". A much more common case of this problem occurs when, for example, a user doesn't use a skill for a long period of time and then returns.
But today we're talking about barge-in, and what we can do to solve the problem of barge-in preventing users from hearing important information.
The easy solution
The voice assistant provider absolutely knows whether or not an intent started with a barge-in or via a user waiting until the microphone was opened normally. They just don't pass that information along today. Ideally, each intent in a live session (note that this problem does not apply to one-shot invocations) would have a simple boolean attribute in the request payload saying whether the response output was completed. You could theoretically take it further and ask for a timestamp or the part of the response where the barge-in happened, but the single flag really solves the 90% case.
But, at least on the Alexa side, we don't have that today, so lets talk about a few things the dev community has been thinking about in its absence.
If you know the minimum length of time it takes to play the TTS content for a given intent, you could record the request timestamp and save it in user session. Then check on the next request, to see if diff was less than that (ie "barge in") & then you could mark it for replay.
— Mark Tucker (@marktucker) May 31, 2020
Playing the timing game
"If the next response in this session occurs at or earlier than current time + T, then the response occurred by means of a barge-in"
Realistically, you can probably stretch it even a little further with a high level of confidence by having a good sense of what the minimum overhead for a request-response cycle is. But what you definitely cannot do is make the opposite statement:
"If the next response in the session occurs later than current time + T, then the response did not occur by means of a barge-in"
The reason here is that response time is an aggregate of many variables, and there are any number of combinations that might get you to a larger-than-T response time. Lets enumerate:
- Different users have different internet speeds, which may result in different amounts of time for the audio to stream. This happens both in terms of the rendered audio going to the user, as well as the gathered audio going back up to Alexa's ASR engine.
- As noted by Jeff Blankenburg, Alexa provides a mechanism to have output text read at different speeds by default, which fundamentally undermines the assertion that you can calculate T.
- Barge-in generally requires a wake word. A user with a wake word of "Echo" (2 syllables) is going to take less time to interrupt their smart speaker than one who has to say something like "Okay Google" (4 syllables).
- Further, the common pattern that I see users follow is something like "Wake word, brief pause to ensure recognition by the device, utterance", but that middle pause is variable for every different user.
- The device being used also inserts its own latency - a 2nd Gen RPi running Alexa Pi is not going to be as snappy as a newest gen Echo.
- Speech-to-text time is also not a constant - high traffic can certainly introduce a delay in a user's response making it back to the skill.
All of these variables pale in comparison to the three biggest factors, however...
- The user's choice of words is going to be massively important in measuring the response time. Even for something simple like a yes or no response, there's a big difference between "no" and "no thank you, Alexa".
- Along those lines, the user's spoken cadence matters - a slow-spoken user is going to have a longer turnaround time than average.
- Finally, you may have noticed that sometimes - especially when there's a lot of background noise - your smart speaker has a hard time knowing when the utterance is complete. Those two seconds waiting for the microphone to close are a significant portion of the total response time.
So, sure, if your 4.3 second message garners a response 5.8 seconds later it's possible that the user heard everything and replied with a terse "yep". But it's also possible they barged in at 3.6 seconds, had a couple seconds of response audio, and added 400milliseconds of overhead from their slow internet and old device.
And there's another important factor here, which is that the entirety of this technique is predicated on accurately knowing your TTS time for a given input string, which does not fit well at all another best practice - generative text. If you are building your response strings to sound more human by variably prepending acknowledgement words ("Hmm...","Ahh,") and by having multiple phrasings for each response string, potentially combinatorially built from smaller connecting phrases, then the odds are low that you'll ever actually have an accurate estimate of your TTS time.
In general, there may be some value here in terms of a strict "didn't hear" cutoff and then a secondary "technically might've heard, but probably didn't" confidence interval, but for the most part the number of variables involved is going to make the feature extremely inaccurate, and implementing such a system would be a lot of work to begin with.
Perhaps have them say a phrase of the day to get an in-game bonus? The phrase is at the end of the announcements.
— Greg 'Papa Oom Mow Mow' Bulmash (@YiddishNinja) May 31, 2020
If they start playing w/o getting the bonus... "One golden egg is available if you say the phrase of the day. Would you like to learn it?"
Incentivizing listening
Unfortunately, it probably isn't broadly applicable to the entire problem space. First, it benefits from being at the same point in the skill flow every time, which means the users know they should be listening for it. This is great if you can pull it off and always have a dedicated "new stuff goes here" segment, but in a lot of cases you need to provide new information to the user inline, and triggers that happen in the normal course of interaction can't wait until the next time the user ends the session and starts a new one.
Further, a lot of skills won't have any levers to pull in terms of incentivizing their users. For a game skill, sure, you can usually find a way to give a way a little bit of content or something. But what about for a utility skill, where the user is just trying to complete a discrete set of actions? What would the "Golden Egg" be in a skill about recording your pet's circadian rhythm to help them get their best sleep? (That's a free idea for any of you to implement. You're welcome.)
Finally, there's a potentially-good/potentially-bad follow-on effect of this approach. When you train a user to interact with your voice user interface in a certain way, you kind of have to assume that you're training them to use it that way across the entire experience. So if you're teaching them that barge-in is bad during launch, chances are that they're also going to apply the "barge-in is bad" mantra across the rest of the experience. Is that really what you want? I don't think that's by any means universal, however. Gal Shenar made strong arguments about this exact point on the Alexa Slack instance of this discussion. And in fact, there are skills where barge-in is fundamental to their UX - St. Noire is a perfect example.
To add in: 1) For us, this goes back to the overall design. When and how are you introducing new features? We use sound effects when there’s something new, so people know the listen up, and then we try to keep the same flow and rhythm of the experience.
— Sarah Andrew Wilson (@SarahAndrewWils) May 31, 2020
Won't somebody please think of the earcons?!
Alas, I don't see a lot of voice applications taking this sort of advanced audio signaling approach yet. If we do get to a point where it's commonplace, that probably brings with it concerns about audio information overload. Or, maybe this is something that leads to standardization? Could there be a scenario where multiple skill makers use the same set of sounds to mean the same things?
A good question to which I have no good answer except to try to write prompts in a way that reduces the likelihood of barge in eg don't introduce a new feature at a point where the user thinks they already know what they're doing. Doesn't help if people decide to exit though...
— Tom Hewitson (@tomhewitson) May 31, 2020
Reducing the desire to barge-in
There are a couple of ways you can optimize for this. The first, as Tom notes in his Tweet, is to just avoid dropping important information at a time when the user is performing a task that lends itself well to barging to begin with. That might be more easily said than done, but I think as a general rule of thumb it's something you can tack on to pretty much any other approach you'll use. It's worth noting here, too, that there's sort of a chicken-and-egg problem: How do we reliably identify which of our contexts garner the most barges-in when that attribute isn't passed along to us? And the flip side of this is that it identifies a secondary benefit to theoretically providing the barge-in flag - it will allow us in aggregate to figure out which points in our user flows are most annoying our users!
On that note, the other way that you optimize for Tom's approach is to just not give your users any reason to barge-in. Make sure that, as often as possible, your users are not hearing extraneous output if they don't want it (this last part is important - some users DO want it). This topic by itself is worth a whole series of VUXcellence articles, but I'll take a moment to toot my own horn here and mention that the optionally enabled "short mode" I implemented in CompliBot and InsultiBot is one of the most meaningful features I've ever built, and I'd highly recommend that any skill that strives to be highly engaging go through the effort of letting their users configure how verbose of an output they hear.
Use it to prove it
There's an obvious downside to this, which is that a user may actually hear the message, understand it, and choose not to interact with it for the time being. In that case, the user will continue to be nagged about it on future iterations of whatever it is they're doing (sort of violating the exact thing I just praised myself for in the section above). But users will pretty universally understand that they also have an easy way out of the nag - just do the damn thing. You can think of the recurring nag as maybe the voice equivalent of an "unread" notification on a mobile app.
My current approach
User: Alexa, launch foehammer
Alexa: Welcome back what would you like to do?
User: Smash my foe
Alexa: You smash your foe right in their dumb face. Hey, turns out you also have the option to befriend your foe. Wanna hear how?
There's still the aforementioned downside of the user who does listen, but just wants to keep on swinging his foehammer, and therefore keeps hearing the nag over and over. There's also one sneakier problem that I don't have a solution for. Consider that instead of saying yes or no, the user might instead say "help". This goes in to my help intent handler, and since the prompt was about befriending, I route it to "help about befriending", which as it turns out is exactly the same thing as just answering "YES" to begin with. But what if the user barged-in, didn't hear the prompt, and then asked for help thinking they would get tips about smashing faces? They'd then be presented with an explanation of befriending, completely out of context, and I suspect it would be quite a disorienting situation.
In general, though, I think the edge cases here are greatly outweighed by the value of this UX, and I'm pretty happy to move forward with this as my plan.
I'm curious what the rest of the community thinks about the ideas laid out here. Are there other approaches we haven't stumbled upon?
I also dropped a feature request for this on Alexa Uservoice, if you care to give it an upvote: https://alexa.uservoice.com/forums/906892-alexa-skills-developer-voice-and-vote/suggestions/40554694-provide-barge-in-flag-on-request