3PO-LABS: ALEXA, ECHO AND VOICE INTERFACE
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact

3PO-Labs: Alexa, Echo and Voice Interface

VUXcellence: Handhold Mode

5/1/2019

3 Comments

 
Nobody can argue against the fact that the Alexa platform has grown in leaps and bounds over the last two years. Many of the problems we faced as voice designers are gone or mitigated, and we have a million tools at our disposal to address the issues remaining. Definitely a good thing, but it leads to a couple new pitfalls. The first is that we now have so much more "rope to hang ourselves with", so to speak. There are a ton of failure modes that simply didn't exist when we were building CompliBot and InsultiBot in 2015. At the same time, all of these new features have upped the level of what users expect out of a baseline Alexa experience, meaning the onus is on skill builders to solve increasingly complex problems. What I want to talk about today is one of these problems - how do you know if your user is having a bad experience, and what can you do about it?


The inspiration for this VUXcellence post actually came from a few places, which I'll call out to start out. After that, I'll jump into some details of what I'm doing with CompliBot and InsultiBot along these same lines, as my approach is one that felt like a good first step along a much longer path - one that I'd love reader input on.

One for the (intent) history books

I started having conversations about this topic in the summer of 2018 with Nick Schwab, the founder of Invoked Apps and the wildly successful sound skills that have been dominating the skill store for the last couple years. The topic was around metrics and trying to understand the quality of experience that each user was having, rather than just looking at aggregate numbers like session length and doing cohort analysis or A/B testing. Nick's metrics pipeline has always been way better than anything backing the 3PO Labs skills (in fact, he presented on it at reInvent 2017), and that data and those techniques are certainly valuable for a bunch of different reasons, and in a post-mortem they're helpful in assessing that some of our users were struggling, but one thing they don't do is tell us why things went poorly for those users.

Fortunately for us, we were coming off of a new Alexa feature that dropped last Spring to entirely too little fanfare - the Intent History API. As any Alexa developer knows, you don't get the actual utterance in any request that comes to your skill - you get the much simplified (and therefore lower fidelity) intent and slot values. As developers, we always knew when users were saying things to our skills that worked, because our model was built from a constrained list of possible utterances. What we could never know before this feature is what things our users were saying that were either incorrectly mapping to the wrong intent, or which were causing them to fail out of the skill entirely. The release of Intent History flipped that entirely.
Picture
Some samples of the weird things people say to InsultiBot today
The Intent History feature provided us with aggregate data describing the exact phrases users were uttering, and whether or not they were mapping to the right place in the skill. With this in hand, we could actually understand why some users were getting high failure rates, sometimes resulting in bad reviews. For example, in my case, one really common failure mode was when a user was trying to exit, they'd say something like "stop InsultiBot".

Unfortunately for them, Alexa was keying in on the use of the "InsultiBot" name, and mapping the intent to "Who is {botName}", a seldom-used meta intent that lets users ask AstroBot/CompliBot/DiceBot/InsultiBot about any of the other bots. You can imagine a user - who is done with the skill and just wants out - getting frustrated when they are repeatedly being told about the skill they're trying to leave.
I had known for over a year that users were hitting the "WHO_IS" intent far too often, but until this point there was no way to know the exact bad experience these users were having. And in retrospect it's silly, it's such a simple fix to add "stop InsultiBot" to the StopIntent, and it could've been implemented at any time had I just known, and I could've avoided reviews like this one:
Picture
So, big fan of IntentHistory. It allowed folks who were interested in refining our skill models to iterate in pursuit of a cleaner experience. The primary downside, though, was that it was still a passive approach - it allowed us to correct problems, but not while the problem was happening​.


re:Inventing the handhold

The biggest inspiration for the feature I ended up building, and ergo for this post, came out of an off-handed comment during a re:Invent 2018 talk. Gal Shenar, of Stoked Skills was discussing his escape room skill, and mentioned some of the struggles that he'd had with users getting confused. His assertion was that everyone builds help text into their skills, so why not actively watch for triggers that might imply that the user is confused and needs that help again, even if they didn't ask for it? And indeed, he was seeing that offering that help, which the users didn't even realize they needed, was having great returns.

That was sort of an epiphany moment for me, as I realized that I already had (or thought I had...) the mechanism for catching these triggers - I was just using it in reverse for a completely different purpose. I think I went back to my hotel room that night and started fiddling with what would become my eventual solution.

What's the inverse of a graduated back-off?

So, having seen Gal's talk, the thing that stuck out to me was that we don't always want to give people help when they haven't asked for it, so how can we know that our users really need it, without them explicitly telling us? And it occurred to me that this was the exact opposite of one of our earliest features, the graduated backoff of our handhold mode.

When CompliBot and InsultiBot first went to cert, we had a month long argument about whether or not we needed to be prompting users after every turn of our dialog. The Alexa design rules were very different at the time, and certification was extremely paranoid about the idea that users wouldn't know what to do if we didn't expressly tell them what to do after every single response. Our user testing didn't show this to be true for skills as straightforward as CompliBot - it was super apparent that you could either make your request again, ask for another, or exit - but cert wasn't budging so we had to give them something. Begrudgingly, we implemented a graduated backoff of what we described as our handhold. After the user had successfully completed a few turns in a given session, we shortened our prompt. A few more turns, and we dropped the prompt altogether, providing only a reprompt if they didn't say anything. And finally, we threw away the reprompt too, the idea being that once a user had heard the prompts several times, they already knew what their options were, and didn't need to hear it again.

This seems quaint today, when almost every polished skill is doing something of the sort, but at the time it was really only our skills and those built by Jo Jaquinta and TsaTsaTzu that were doing anything more advanced with prompting. The connection I made after Gal's talk was as follows:

"If we can track a user's session and count the number of good requests they're making to assert that they are well equipped to use the skill without us holding their hand, why can't we also track how many bad requests they're making in order to assert that they actually are in need of immediate help?"

And indeed, at a basic level the concept made total sense. If I could keep track of negative events happening in a session, I could eventually reach a point where I implicitly knew it was time to intervene.

In which I geek out at length on fallback algorithms...

In practice the implementation ended up being a bit more tricky. Transitioning a user off of the session-start handhold mode was really just looking at session length, because for the normal user the length of the session was directly proportional to whether or not they knew how to use the skill. For the inverse, however, we needed to know how often negative events were happening for the user.

So what I did was define an arbitrary session variable named "score". This score would start at 0, and count up every time a user hit a "good" intent (which is to say not "Help", "Fallback", or a couple of meta intents). There were a few things they could do to keep the score neutral - asking for repeat doesn't change it, unless it's a repeat of "Help". And of course, asking for Help or hitting the Fallback decreases their score.

Originally, this was an entirely linear algorithm. One point up for good, one point down for bad. The problem, then, was deciding where to start inserting the handhold again. It didn't make sense to do it at any positive score, because with the increment starting at zero, every user would then fall into that bucket. At the same time, a user who triggered a few good intents and got their score up to 5 would then have to trigger 6 bad intents in a row before being given any help. Obviously most users are lost by that point.

So I iterated on the algorithm and decided that instead we would consider consecutive "bad" triggers geometrically. A user who had one good intent, one bad intent, and one good intent, in sequence, would still end up with a net score of +1. But for any user triggering multiple bad intents in a row, the score added follows the following strategy:
Picture
 Which is to say a user having two bad events in a row accumulates a -1 and a -2 for a net of -3. Three failures in a row gets -1, -2, -4, for a net of -7. You can see, then, how that same user who had 5 good intents before running into trouble is now ending up in negative territory in half the number of failed intents.

Once the scoring mechanism was in, dealing with the issue was quite simple. Any user that has their score dip into the negatives immediately gets prompts and reprompts turned back on. Taking it further, any user who manages a score of -10 or lower triggers my "frustrated user" flow, where I actively give them some extra help text, written of course in the voice of whichever bot they're talking to. In the case InsultiBot this might look something like: "Wow, are you really that thick-headed? Let me make this simple, you can say `insult me`, or `another`, or just ask for help. Or you can say `stop` in order to exit, and finally give me some peace and quiet."

Additionally I have any frustrated user trigger send me a slack message, dumping the entire session history for that user so I could see what flow got them to that point. That's a story for another post, however. So far, this system seems to be working well for me, but I'm curious to know what approach other skill builders are taking to this problem. And of course I'd love to know what folks think of my geometric scoring method, and whether there's a better way to qualitatively judge a user's experience.
3 Comments
Jo Jaquinta link
5/8/2019 04:25:48 am

Very worthwhile advice. An alternative to your geometric approach might be to take a lesson from Quantum Physics. Instead of a deterministic algorithm, use a probabilistic one. I.e. use the score to determine the % chance that you give the user a prompt, then roll the dice to determine if you do so. That puts a little perturbation into how the skill reacts which can make it less frustrating.
Or maybe I play too much D&D. :-)

Reply
Eric
5/18/2019 10:45:29 am

I think the place where that idea actually makes the most sense is on the initial handhold rampdown - it's probably a better facsimile of a "natural" conversational flow than a hardcoded backoff. There are a few places in my code where I do similar things right now - on session exit I sometimes will randomly give them a parting message, and in CompliBot if a user says "thank you" I will look at a few signals that I feed into an RNG to decide whether or not to prompt them for a review.

I think for the "user having a bad time" flow, though, that's a case where I definitely want to chime in every time, once they hit a threshold.

Reply
writing elites reviews link
5/16/2020 11:44:14 pm

How I wish I am knowledgeable about this stuff so I can share my opinion too. But this is about Alexa platform which is something that I don't know that's why I am clueless on what to share about this matter. But still, I am thankful that you have shared it to us because there are some who might get inspired from this article and they might be well-informed now. I am just hoping that more and more people will be much aware of what's happening.

Reply



Leave a Reply.

    Author

    We're 3PO-Labs.  We build things for fun and profit.  Right now we're super bullish on the rise of voice interfaces, and we hope to get you onboard.



    Archives

    May 2020
    March 2020
    November 2019
    October 2019
    May 2019
    October 2018
    August 2018
    February 2018
    November 2017
    September 2017
    July 2017
    June 2017
    May 2017
    April 2017
    February 2017
    January 2017
    December 2016
    October 2016
    September 2016
    August 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015

    RSS Feed

    Categories

    All
    ACCELERATOR
    ALEXA COMPANION APPS
    BOTS
    BUSINESS
    CERTIFICATION
    CHEATERS
    DEEPDIVE
    EASTER EGG
    ECHO
    FEATURE REQUESTS
    MONETIZATION
    RECAP
    RESPONDER
    TESTING
    TOOLS
    VUXcellence
    WALKTHROUGH

Proudly powered by Weebly
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact