3PO-LABS: ALEXA, ECHO AND VOICE INTERFACE
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact

3PO-Labs: Alexa, Echo and Voice Interface

A Treatise on Testability

3/31/2016

4 Comments

 
In watching the rapid growth - and associated growing pains - of the Alexa Skills Kit over the last six or seven months, there's been a recurring theme that we've noticed lurking in the shadows behind a lot of the issues faced by the development community:  Many of the problems we run into could be greatly mitigated by good testing, but good testing is nigh impossible at the moment. Alas, it doesn't have to be this way - the path out lies after the break.


TL;DR: Testability isn't the only shortcoming right now, but solving it may provide the best return on investment.  Alexa is both a black box and a moving target.  The only way we can begin to wrangle it is with a concerted effort to add the necessary hooks to do full-stack testing.  Until then, skills will remain flaky and devs will keep feeling angry about how helpless they are.
Alright, if you made it past the TL;DR that means you're probably in this for the long haul.  Grab something to snack on and get comfy, because there's a lot to work through here.  Let's start by defining a few things:

Glossary

  • Testability: Testability is the degree to which a product can be reasonably tested.  Within the scope of this article, we specifically care about the testability of how our skills are integrated with Alexa. There are myriad different aspects of testability that you might consider, and this is a topic that is consistently talked about by some of the brightest minds (James Bach, Cem Kaner, Lisa Crispin, et al) in software quality.  Our approach today will be a simplistic one - to say that more testability is better, and to propose specific actions to improve it.
  • Black Box: When we say that a system is a black box, we mean that its inner workings are not visible to us as consumers.
  • Synthetic Monitoring: Synthetic monitoring is the idea of observing (on some recurring basis) the state of a system by pretending to be a normal user and interacting with that system.  This is distinct from a standard monitoring approach, which uses custom-built metrics or utilities to convey system state.

The Problem Space

Usually this is the point in a blog post where I'd lay out a nice concrete issue, and start talking about who it's hurting and why it's important.  Unfortunately, what we're dealing with is a much more vague, and can't be so easily summarized.
At its most abstract, the problem is that "Alexa is harder to build for than it should be". Actually, you know what?  Scratch that, lets rephrase.
"It's really easy to build Alexa skills, but it's really hard to build Alexa skills well."
That does a good job of encompassing the problem broadly, but it also succeeds in being suitably generic and subjective enough to be meaningless.  To add some substance to the claim, we'll start with a sort of "proof by induction" approach, aggregating various anecdotes, and then extracting patterns or commonalities from them.  Each individual example may not hold much weight, but when the data points are combined, we can gain a lot of knowledge.  "E Pluribus Unum" or something, right?

Example 1: Something broke, my users say

With Alexa Skills Kit, we are dependent on Amazon's platform for our products to work.  It's great that they've built something that does a lot of the hard work for us, and the tradeoff that we accept is that we are completely vulnerable to the whims of their workflows.  When the winds of change come rolling in, they wash straight over production without warning, and we either roll with the waves or wipe out.
Picture
There's a commonly repeated joke in the software development world - "Why do I need testers, when my users will test for me?". On multiple occasions, skill developers have found out that an update on Amazon's side resulted in their skills no longer working.  They often found this out by an influx of bad ratings for their skill.

A few examples:
  • That time that redirect_uri for OAuth was changed
  • Last week, when a new field was added silently (admittedly, we missed something on our side, but still...)
  • Skill outages, like the ones reported on by TsaTsaTzu on March 22, or the more widespread one that is occuring right now, March 30, as I write this.

Now, as anyone who has worked in software knows, these things happen. I'm not here to pass judgement on their uptime or the quality of what they release.  Just to paint a picture of the current state.


Example 2: The bait-n-switch

So now lets talk about a different case - the case where something changes but doesn't cause things to fail.  This seems to happen a lot in relation to the voice model and the way things are parsed in to intents.  Occasionally, someone will notice that a phrase that was operating in one fashion one day starts operating differently the next.
The thing to understand, though, is that this is absolutely a good thing!  One of the most glorious aspects of Alexa is that she's always learning and getting better at matching - almost everything that comes in is used as a data point to train the system. 
That said, sometimes people can have a hard time with change, especially change that they weren't expecting.  Recently, the Echo user base seems to have lost its collective mind over something as simple as a change in the acknowledgement Alexa gives to a request, from a beep to "OK".
Picture
As developers, we have to be doubly aware of the ever-changing nature of the system we're integrated with. As a result, building resiliency into our systems is that much more important.  A good fallback is pivotal in being able to evolve your skills with the system.
Beyond that, though, we also have to expect that what was passable at one point in time may not always be that way.  The phrase "moving goalposts" comes up a lot - especially when talking about certification, but also in terms of the voice model. There's a veritable mountain of cases where developers have noticed changes without touching anything on their side.
In fact, our team actually documented one such occurrence as we were preparing to submit for certification of CompliBot and InsultiBot.  We noticed that between test passes of our code (where we had changed nothing), the range of values of their built in list of US first names had ballooned. The result was that words that many of our test cases used began incorrectly matching less-common names, thus failing those tests.  In the long run, the change was a good one - the list was more comprehensive - and we were eventually able to work around the problem with other tweaks, but the fact is that when the change came through it caught us wholly off guard, and invalidated a lot of test data that we had painstakingly gathered manually.
Which leads into our final point...

Example 3: We're super "Agile" as far as "Waterfall" goes

The process of building a skill right now can be slow and painstaking.  As much as the "Build a Skill in Under an Hour" thing is being pushed, development of bigger, more complex skills is a lot more involved.
A lot has been written about certification (by us, and by others), and there's no need to rehash that right now, but it's important to call out that the process is necessarily iterative.  Code is submitted to a reviewer, and then when it reaches the front of that reviewer's queue, feedback and critiques are given (some mandatory, some optional).  A developer should expect at least a couple rounds of back-and-forth.
While this review process is iterative, the Agile sentiment of fast iterations, fast failure, (etc) does not lend itself well to working with Alexa right now.  Changing the sample utterances, custom slots, invocation name, or intent schema is a manual UI process, which forces a rebuild of your skill's voice model.  Each time, this should then be followed by whatever degree of testing your team deems necessary.
For a team concerned with quality, this can be incredibly onerous, since at present it is assumed that most teams are either building physical devices to do to do automated testing, or (more likely) just testing manually.
Makeshift physical test harness
GET IT? It's "test harness" as a double entendre!...

We get it, being a skill developer sucks...

Alright, so at this point it probably seems like I'm piling on, bringing up reason after reason that it's hard to be an Alexa developer.  BUT!... That's not the point of this at all.  To be perfectly honest, Alexa is one of the cleanest systems I've ever had the opportunity to work with.
The reason I bring all of these things up is to show their commonality.  While it may seem like all of the issues are distinct (what do production breaks have to do with slow iterative cycles or the fluidity of an ever learning voice model?), what they all share is a potential mitigation path.
I assert here that if developers had access to stronger test utilities for the Alexa platform, the severity of each of these shortcomings (and many more!) would be considerably less.¹
Consider,
  • A platform where we could do end-to-end testing in a way that was recurring - where a synthetic transaction would occur every five or ten minutes to confirm that everything was still hunky-dory.  No more finding out from users that your skill is now mishearing the name for the rapper "Coolio" as "Culo" - the Spanish word for "ass".
  • An API that allows a user to analyze the way some voice clips are being resolved into intents to help figure out why their user logs show them getting in to a weird state where they just keep asking for help over and over.  (This is a real, mysterious recurring problem for our team).
  • A system that allows a dev team to make a minor tweak based on certification feedback, and then kick off an entire suite of automated voice tests to confirm that the certification suggestion didn't degrade performance of their skill's voice model.
  • To build on the previous point, imagine a world where test automation for Alexa was so straightforward that the dev community could work together to build cases for people with different accents, genders, etc. Instead of one dev repeatedly testing a skill manually with their own voice, we could build out a sharing economy where a quid-pro-quo could get you audio files of your test cases read by someone else, in exchange for you reading their utterances.
Beyond all that, there's the market argument.  At some point, in order for Alexa to be considered a full-fledged platform by mature software shops, it's going to need to support the standard CI/CD workflows that have become commonplace. That doesn't happen without test automation.
With that in mind, it seems like enhancing the test automation capabilities provided to skill developers would have enough of a payoff to warrant the necessary work on Amazon's side.

Moar solutions plz...

What we propose, then, would look something like this:
  1. ​New API for developers to use
  2. Accepts an audio file
  3. Uses skill's live model, just as a user's request would ²
  4. Returns to the user (as JSON) a payload describing what the request resolved to, and what the theoretical skill request would've looked like.
  5. Does NOT actually call out to the skill's Lambda or webservice
  6. Can be secured by product id, login with amazon, or any other token system
  7. Can be rate limited to avoid abuse.
​This would cover many of the use cases described above, especially when used in conjunction with existing test utilities (like the Alexa Service Simulator, or ASK Responder). It's not a perfect solution, but it would definitely put us in a better place.  

What do you guys think?  Let us know in the comments below, in the communities where we link this post, or feel free to contact us directly.
-DERP

Extra Notes:

 
¹ We do want to recognize that the Alexa team is not blind to the testability needs - we've received two really handy tools in the last few months in the Service Simulator and the Voice Simulator.  While those are awesome, they leave a big gap in terms of testing the front-door of our skills, with the voice model still being a black box.
 
² The existing test utilities have unfortunately tended to suffer from bit rot, with those tools not maintaining feature parity with the the live service when new features roll out (which is when the utilities are needed the most).  That's why we propose running this tool through the live model.
4 Comments
Jeremy link
5/30/2016 12:08:13 pm

I totally agree with you guys! One of my biggest issues has been even though my code works perfectly fine, my SSML will not be correct. or how some words Alexa is amazing at picking up, while others turn out to be just gibberish to Alexa. I spent hours trying out different invocation names to see which ones worked the best. I started putting together a code example on how to unit test an Alexa app, but that still does not solve the "moving goalposts" problem. https://github.com/jjbskir/alexa-skill-unit-tests-js

Reply
Eric link
6/1/2016 09:56:46 pm

Your unit testing example is good stuff. It's frustrating that so little focus has been given to testing, although it makes sense given the push for "1 hour skills" (or even more recently "5 minute skills"). The people taking this approach are not the same people who are looking to write what I'd describe as traditionally "good" software.

When we originally posted this article, our intent was to follow up with some ideas about how to implement things in our own way. The reason we haven't done that is that we continually get the sense that Amazon is super close to announcing something that will make our approach obsolete. Unfortunately, it's now been months and they have yet to announce anything, and we're still not being given insight into whether anything is coming down the pipeline. That lack of knowledge is paralyzing, though, because we don't want to spend a bunch of time building something that Amazon is going to make a better version of.

Reply
resume planet link
1/17/2020 03:12:31 am

It is important that something is testable, especially in our lab. We use all sorts of things in our research, and believe, we use EVERYTHING. I think that if we cannot test something, then that will not make things easy. As researchers, we need to be able to test and learn about everything we come across. I want to go and learn about the world, and this will really help come a step closer to it, so I hope that it can be tested.

Reply
Erica link
12/12/2020 11:51:59 pm

Nice blog you have thanks for posting

Reply



Leave a Reply.

    Author

    We're 3PO-Labs.  We build things for fun and profit.  Right now we're super bullish on the rise of voice interfaces, and we hope to get you onboard.



    Archives

    May 2020
    March 2020
    November 2019
    October 2019
    May 2019
    October 2018
    August 2018
    February 2018
    November 2017
    September 2017
    July 2017
    June 2017
    May 2017
    April 2017
    February 2017
    January 2017
    December 2016
    October 2016
    September 2016
    August 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015

    RSS Feed

    Categories

    All
    ACCELERATOR
    ALEXA COMPANION APPS
    BOTS
    BUSINESS
    CERTIFICATION
    CHEATERS
    DEEPDIVE
    EASTER EGG
    ECHO
    FEATURE REQUESTS
    MONETIZATION
    RECAP
    RESPONDER
    TESTING
    TOOLS
    VUXcellence
    WALKTHROUGH

Proudly powered by Weebly
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact