One for the (intent) history books
Fortunately for us, we were coming off of a new Alexa feature that dropped last Spring to entirely too little fanfare - the Intent History API. As any Alexa developer knows, you don't get the actual utterance in any request that comes to your skill - you get the much simplified (and therefore lower fidelity) intent and slot values. As developers, we always knew when users were saying things to our skills that worked, because our model was built from a constrained list of possible utterances. What we could never know before this feature is what things our users were saying that were either incorrectly mapping to the wrong intent, or which were causing them to fail out of the skill entirely. The release of Intent History flipped that entirely.
Unfortunately for them, Alexa was keying in on the use of the "InsultiBot" name, and mapping the intent to "Who is {botName}", a seldom-used meta intent that lets users ask AstroBot/CompliBot/DiceBot/InsultiBot about any of the other bots. You can imagine a user - who is done with the skill and just wants out - getting frustrated when they are repeatedly being told about the skill they're trying to leave.
re:Inventing the handhold
That was sort of an epiphany moment for me, as I realized that I already had (or thought I had...) the mechanism for catching these triggers - I was just using it in reverse for a completely different purpose. I think I went back to my hotel room that night and started fiddling with what would become my eventual solution.
What's the inverse of a graduated back-off?
When CompliBot and InsultiBot first went to cert, we had a month long argument about whether or not we needed to be prompting users after every turn of our dialog. The Alexa design rules were very different at the time, and certification was extremely paranoid about the idea that users wouldn't know what to do if we didn't expressly tell them what to do after every single response. Our user testing didn't show this to be true for skills as straightforward as CompliBot - it was super apparent that you could either make your request again, ask for another, or exit - but cert wasn't budging so we had to give them something. Begrudgingly, we implemented a graduated backoff of what we described as our handhold. After the user had successfully completed a few turns in a given session, we shortened our prompt. A few more turns, and we dropped the prompt altogether, providing only a reprompt if they didn't say anything. And finally, we threw away the reprompt too, the idea being that once a user had heard the prompts several times, they already knew what their options were, and didn't need to hear it again.
This seems quaint today, when almost every polished skill is doing something of the sort, but at the time it was really only our skills and those built by Jo Jaquinta and TsaTsaTzu that were doing anything more advanced with prompting. The connection I made after Gal's talk was as follows:
"If we can track a user's session and count the number of good requests they're making to assert that they are well equipped to use the skill without us holding their hand, why can't we also track how many bad requests they're making in order to assert that they actually are in need of immediate help?"
And indeed, at a basic level the concept made total sense. If I could keep track of negative events happening in a session, I could eventually reach a point where I implicitly knew it was time to intervene.
In which I geek out at length on fallback algorithms...
So what I did was define an arbitrary session variable named "score". This score would start at 0, and count up every time a user hit a "good" intent (which is to say not "Help", "Fallback", or a couple of meta intents). There were a few things they could do to keep the score neutral - asking for repeat doesn't change it, unless it's a repeat of "Help". And of course, asking for Help or hitting the Fallback decreases their score.
Originally, this was an entirely linear algorithm. One point up for good, one point down for bad. The problem, then, was deciding where to start inserting the handhold again. It didn't make sense to do it at any positive score, because with the increment starting at zero, every user would then fall into that bucket. At the same time, a user who triggered a few good intents and got their score up to 5 would then have to trigger 6 bad intents in a row before being given any help. Obviously most users are lost by that point.
So I iterated on the algorithm and decided that instead we would consider consecutive "bad" triggers geometrically. A user who had one good intent, one bad intent, and one good intent, in sequence, would still end up with a net score of +1. But for any user triggering multiple bad intents in a row, the score added follows the following strategy:
Once the scoring mechanism was in, dealing with the issue was quite simple. Any user that has their score dip into the negatives immediately gets prompts and reprompts turned back on. Taking it further, any user who manages a score of -10 or lower triggers my "frustrated user" flow, where I actively give them some extra help text, written of course in the voice of whichever bot they're talking to. In the case InsultiBot this might look something like: "Wow, are you really that thick-headed? Let me make this simple, you can say `insult me`, or `another`, or just ask for help. Or you can say `stop` in order to exit, and finally give me some peace and quiet."
Additionally I have any frustrated user trigger send me a slack message, dumping the entire session history for that user so I could see what flow got them to that point. That's a story for another post, however. So far, this system seems to be working well for me, but I'm curious to know what approach other skill builders are taking to this problem. And of course I'd love to know what folks think of my geometric scoring method, and whether there's a better way to qualitatively judge a user's experience.