But even then, with a much more limited platform and a (relatively) tiny developer community, one thing that was clear was that the paradigm-shift involved with with developing an Alexa skill brought with it testing challenges that weren't exactly straightforward. And further, it was clear that their dev tools were rapidly outpacing their test tools.
So where do things stand today as far as testability? That's a hard question to answer, as it's so multi-faceted. Two things are apparent: Development has continued at a breakneck pace; and, the community today is much more interested in testing than they were at the time. As for everything else, you'll have to judge for yourself.
A Plethora of Positives
Testing pays the bills:
The first (and most exciting for me) is that the community has taken the challenge to heart. We've seen innumerable frameworks/tools/utilities popping up for the purpose of testing Alexa. More importantly, we've seen a few businesses get going whose core competency is testing for voice assistants, which is an undeniable sign of maturity for the platform. I want to take a moment to call out two of them.
It's our strongly held belief that the emergence of test-first companies is a sign of maturity for the platform - essentially that companies are now at the point where they need a long term "Alexa strategy" which includes iterative updates and maintenance (and therefore test automation), rather than just throwing things at a wall to see what sticks.
It's also worth noting the advances Amazon has made on this front. The article was written as the v1 iterations of Amazon's first Alexa test tools - the skill simulator and voice simulator - had just come out. And that was the state of things for quite a while. There were occasional updates to the skill simulator (although often it did not maintain feature parity with what you could do via an actual device), but for the most part it didn't change for a year and a half. Late last year, though, we saw the first of several major updates to the developer console in the form, which is now a set of tools known collectively as the "test simulator". Included are a panel for testing display directives, a request/response history with more or less complete session support, and most importantly an AVS implementation that allows you to hold a conversation with your skill from the web, much like another third party test tool from the last few years, Sam Machin's "Alexa in the Browser" / EchoSim.io.
Check out what Amazon is offering now:
All of these cool new test toys seem to the direct result of Amazon forming a "Skill Quality" team within the Alexa org last year. It's something that was sorely lacking early on, and for those of us who are serious about the platform, that team's workstream is certainly paying dividends.
A Shifting Landscape
Take for example, skill localization. When we wrote our original treatise, Alexa (officially) existed for a single locale - US English. Today, Alexa is available in German and Japanese, as well as Australian, Indian, Canadian, and British English (with Ireland reportedly up next). The result is that the testing burden for any given skill is multiplied several-fold, even if it's only released in English. As I'll explain later, there are reasons that this is even worse than it sounds.
Beyond straightforward concerns like localization, Amazon continues to release new features which muddy the testing waters a bit. As an example, where does testing happen when the paradigm for the thing you're testing has shifted a little bit so that it's no longer a simple request-response flow? This is something we're seeing today with new features like Skill Events, Progressive Responses, and most importantly Push Notifications (in fact, we touched on the testing aspect of this in a post a few months ago).
And taking this notion of new flows even further, what happens when you're working with entirely new modes of interaction? As noted above, the test console now includes a feature for testing display directives, but that specifically matches the form factor of the Echo Show. What about the brand new Echo Spot? The circular screen of the Spot has serious implications on how we use display templates! And most recently, the release of Echo Buttons and the Alexa Gadgets API means we have a completely novel input-output mechanism that we have to account for in our skills.
So while the skill quality team is pumping out helpful updates, it's probably safe to assume they're not gonna be lacking work anytime soon.
The Solvable-But-Unsolved
Consider a case where a developer has built a skill, and needs assurance that it will work across a variety of voices - crossing boundaries of age, gender, regional accent. They could certainly spend the time and money to manually test the skill with a broad spectrum of users, and that would likely be a good upfront investment. But what happens the next time the developer wants to iterate on the skill. This idea of continual improvement and iteration of skills is something Amazon has been talking about of late. Repeating that sort of heavy-handed manual testing is certainly not sustainable as part of a continuous delivery pipeline.
Even happy path testing for a multi-lingual (or multi-locale English) skill becomes a huge pain, as you're essentially having to repeat the same flows over and over. As the Alexa tidal wave rapidly spreads to a global phenomenon, manual testing is going to become more and more of a losing proposition. As a tester by trade, and someone who has been more concerned with skill quality than 99% of other Alexa developers, even I have no interest in that sort of repetitive, hands on timesink.
Both of the previous points have a lot to do with how you get a skill out the door, but there's also the problem of what happens once your skill is live. Sure, it's easy to make sure your Lambda or REST service doesn't change - you control that. But what about Alexa? You certainly don't "control" Alexa, and it's pretty easy to argue that nobody really does. There are hundreds (maybe thousands?) of engineers, all of them smarter than me, working daily on changing how the end-to-end flow works. And for the most part, they're making things better. But machine learning, natural language understanding, etc are all inexact sciences. A change that makes 95% of skills resolve their intents as-good-or-better is still making 5% of them resolve worse. How do you know if your skill is in that 5%?
This may sound like an abstract problem, but it happens! When our first two skills (CompliBot and InsultiBot) were in their initial certification phase, we actually saw this happen live. We were lucky enough to be doing a series of manual test passes over a list of 40-some utterances that we wanted to work at launch, many of which included the AMAZON.US_First_Name slot. While we we were working through the tests, all of a sudden we started getting incorrect responses due to the slot mapping to obscure (but phonetically similar) names. Instead of "give me an insult" we were getting mappings along the lines of "give Maeve an insult". As it happens, Maeve IS an English name... one that fell almost completely out of usage in the late 1800s. Right before our eyes, Amazon had expanded that slot's source data from the top 1000 US names to the top 10,000 or so, thereby greatly increasing the chance of collisions and false positive matches. Great for accommodating people with less common names, but it completely messed up the way our skill mapped one really important word.
(As an aside, the Alexa team at the time - late 2015 - was super weird about the whole thing. Despite the fact that we had multiple sets of clear baselines from having run through the test suite repeatedly and recorded our results, they refused to acknowledge that anything had changed. It's one thing to be secretive about your products, but it's another entirely to just lie for no gain. It was early days for the platform, and they were still figuring out how to interact with the community at the time, but it was strange nonetheless.)
The platform has continued to change and evolve today, and presumably changes to the intent mapping today go through A/B tests and produce tons of metrics for Amazon's internal folks before being applied broadly. But that doesn't mean that negatively impacting changes don't make it out the door. Over the first couple of months of this year, we've seen a flood of reports from some of the most well-respected and skilled developers describing a flurry of issues with how one-shot invocations work. And many of these folks are only finding out about it because they are beginning to get negative reviews.
And it's all because there's no way to know how Alexa is going to massage and remold your intent model when the underlying architecture and training data changes.
Just the price you pay, or can we fix it?
As far as we can see, there are three approaches:
- Amazon can just never stop iterating on Alexa. This is a dumb idea, I feel bad for writing it, and I certainly hope you aren't sitting there nodding your head in agreement right now. But technically speaking, it would solve the problem.
- Provide opt-in developer notifications when things change. We actually pushed for this early on - it made sense when Alexa was a speculative technology with a small development team pushing out modest updates occasionally. Today, the Alexa org is a behemoth that is becoming a new pillar of Amazon's business, and as a developer I'm certainly not interested in having my inbox fill up completely with "New Alexa change, retest your skills" emails.
- ------------> Give us a way to test through the front door. <---------------
Well, when you lay it out like that, it seems pretty obvious. But what do we mean by "test through the front door"?
Doors (comma) Front
Amazon builds a REST service. The input to that service is some audio. The output of that service is the JSON that would've been sent to our skill for that audio input.
That's it. That's all we need, and we can do continuous testing. We can do full regression of our voice model for tiny little changes. We can record test snippets for different types of voices and make sure our skills are inclusive in how they resolve. We can invest in porting our skills to other languages, knowing we don't need to go bug our one German-speaking friend every time we make an update.
It doesn't need to do anything more than that, because as mentioned, Amazon and third parties have already covered testing the rest of the flow. This is the final missing cog, and once we plug it in, the gears of quality can start cranking.
What do you say, Amazon? Lets get this done.
Agree that this is a good idea? We have a UserVoice request up now. Disagree? We'd love to hear a solid counter-argument - hit us up in the comments or directly via email/Twitter.