"So you built software that doesn't do what you say?.."
The grand experiment, though, was really about a wider set of questions that I think are important to almost everyone working (or playing) in the field of voice assistants. How do you make your voice interface model natural language more accurately? (Which is one half of an even broader concern - "How do we get people to more naturally engage with software interfaces?" - with the other half being focused on training users to better map their interaction patterns to machine-understandable formats).
The question of how to make voice interfaces more like natural language is incredibly broad, and there are a lot of pieces to it, only some of which are in our control. For example, at present when using Alexa we are still forced to contend with the namespace problem where all skills must be launched via an invocation name ("Alexa, ask CompliBot to give me a compliment") rather than as top level utterances ("Alexa, give me a compliment"). Until Amazon gives us the ability to integrate more smoothly into Alexa, we'll always have that barrier to natural interaction to contend with.
There are some things that are in our control, however, like handling idioms, slang, or other phrases that may seem divergent from our primary language model. For example, we are enamored with the idea of inserting Easter Eggs into our voice models. Unfortunately, each additional utterance you define has the potential to diminish the matching probability of the existing utterances, and so we actually had to pull a ton of things out of CompliBot and InsultiBot to make our voice model understand the user's input at the frequency we needed. In that instance it killed the dream of a "secret menu" of commands, but we knew it was something we wanted to try again when we had the right opportunity.
Finally, we were also interested in how you approach the problem of subtlety in an interface where you have essentially no context. As humans, we often use visual cues (like body language), as well as other audio cues (like emphasis, word pacing, or prosody) to understand when there may be subtext that surrounds the words we're hearing. For a voice interface like Alexa, all of that great contextual information immediately goes out the window.
All of these things led us to an interesting question - sticking fairly closely to natural sounding language, how could we inject some context back in to fairly simple utterances?
After a bit of thinking, the answer became clear: coded language.
The Duck Flies at Midnight
This idea of codifying normal language so that you can speak openly while still communicating extra information to Alexa is what we latched on to and built in to DiceBot.
Encoding into a limited space
The specific challenge, then, was to build DiceBot's voice model in a way that gave us room to differentiate the inputs while not doing anything that a user wouldn't expect from a dice rolling bot. Given the extreme simplicity of our interface, we didn't have a lot of wiggle room in this regard.
The one thing we did have, though, was variations on our utterances. In order to account for the variability in the way people speak, a best practice is to define all of the common ways a user might speak a sentence, and treat them all as aliases for the same input. Our approach was to split these up and apply different semantic meanings to each of them.
Spy status achieved. I assume my laser watch is on its way in the mail.
Technical limitations... or not...
The only issue we did run into is that of how to deal with a user who just happens across the special utterances that confer an advantage. Because of the very natural sounding utterances we used, and compounded by the small set of possible utterances a user might randomly choose, there's a fairly high likelihood that an untrained user would accidentally pick the coded pattern. While DiceBot won't reveal its secrets in this situation, the result would be that the primary user's advantage over this untrained user (in whatever game they are playing) would dissolve. We didn't spend a lot of time worrying about this concern, however, as an untrained user is just as likely to trigger a pattern that works against them as for them.
The Takeaway
It was a fun experiment, though, and one that theoretically has practical application (the idea of a tabletop DM doing an "open" roll but weighting the dice comes to mind). The real goal, though, was to get people (ourselves especially) thinking about how we push the boundaries of the current best practices to try and nudge the state of voice interaction forward.
We'd love to hear your thoughts on the topic in the comments below, or directly via email.