3PO-LABS: ALEXA, ECHO AND VOICE INTERFACE
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact

3PO-Labs: Alexa, Echo and Voice Interface

A Treatise on Testability, Redux

2/19/2018

3 Comments

 
Almost two years ago we sat down to put together all of our thoughts about testing and testability for the fledgling Alexa platform. In light of recent events causing us to link out that article a few times, we decided it may be time to do a bit of a retrospective on the topic, and present our view of where things are today.


It's a little strange to consider how nascent Alexa was when we wrote that original article. At the time, they were still printing out "first 100 skills" shirts for developers, two orders of magnitude less than where we are today. The platform existed in one country, and on only a single device (the gen 1 Echo). Amazon hadn't yet "won" two consecutive Christmas seasons with its devices, nor infiltrated every technology trade show with AVS implementations.

But even then, with a much more limited platform and a (relatively) tiny developer community, one thing that was clear was that the paradigm-shift involved with with developing an Alexa skill brought with it testing challenges that weren't exactly straightforward. And further, it was clear that their dev tools were rapidly outpacing their test tools.

So where do things stand today as far as testability? That's a hard question to answer, as it's so multi-faceted. Two things are apparent: Development has continued at a breakneck pace; and, the community today is much more interested in testing than they were at the time. As for everything else, you'll have to judge for yourself.
Picture

A Plethora of Positives

I figured we could start with some of the major changes that we've seen in the last couple years.

Testing pays the bills:
​The first (and most exciting for me) is that the community has taken the challenge to heart. We've seen innumerable frameworks/tools/utilities popping up for the purpose of testing Alexa. More importantly, we've seen a few businesses get going whose core competency is testing for voice assistants, which is an undeniable sign of maturity for the platform. I want to take a moment to call out two of them.
Picture
One of the first well-supported toolsets out of the gate was Bespoken tools, which originally provided a sort of proxy setup that let you test a skill through the Alexa interface while developing it on your machine. Around that core they've built a ton of other functionality allowing automated testing for a good chunk of the skill lifecycle, while simultaneously building Bespoken into one of the most respected companies in the space.
Picture
On the other side of things is Pulse Labs, who are working on tackling how the human side of testing happens for voice assistants. One of the teams in the inaugural class of the Techstars Alexa accelerator, Pulse Labs is defining what UAT looks like in this paradigm.

It's our strongly held belief that the emergence of test-first companies is a sign of maturity for the platform - essentially that companies are now at the point where they need a long term "Alexa strategy" which includes iterative updates and maintenance (and therefore test automation), rather than just throwing things at a wall to see what sticks.
Amazon beefs up their offering:
It's also worth noting the advances Amazon has made on this front. The article was written as the v1 iterations of Amazon's first Alexa test tools - the skill simulator and voice simulator - had just come out. And that was the state of things for quite a while. There were occasional updates to the skill simulator (although often it did not maintain feature parity with what you could do via an actual device), but for the most part it didn't change for a year and a half. Late last year, though, we saw the first of several major updates to the developer console in the form, which is now a set of tools known collectively as the "test simulator". Included are a panel for testing display directives, a request/response history with more or less complete session support, and most importantly an AVS implementation that allows you to hold a conversation with your skill from the web, much like another third party test tool from the last few years, Sam Machin's "Alexa in the Browser" / EchoSim.io.

Check out what Amazon is offering now:
Picture
You have to admit, that's a pretty slick test setup they have there. Beyond the simulator they also dropped a bunch of test capabilities on us with the release of the Alexa CLI/SMAPI last summer - most notably a CLI utility that lets you make a faux-request to your skill via text and see what the output would be.

All of these cool new test toys seem to the direct result of Amazon forming a "Skill Quality" team within the Alexa org last year. It's something that was sorely lacking early on, and for those of us who are serious about the platform, that team's workstream is certainly paying dividends.

A Shifting Landscape

So, obviously Amazon has doubled-down on solving a lot of those testing problems that existed in the early days of the platform. But in the same way that they've grown a skill quality team to build tools for us, they've also massively expanded their other development teams, resulting in an incredible cadence of new features and products that developers now need to be concerned about.

Take for example, skill localization. When we wrote our original treatise, Alexa (officially) existed for a single locale - US English. Today, Alexa is available in German and Japanese, as well as Australian, Indian, Canadian, and British English (with Ireland reportedly up next). The result is that the testing burden for any given skill is multiplied several-fold, even if it's only released in English. As I'll explain later, there are reasons that this is even worse than it sounds.

Beyond straightforward concerns like localization, Amazon continues to release new features which muddy the testing waters a bit. As an example, where does testing happen when the paradigm for the thing you're testing has shifted a little bit so that it's no longer a simple request-response flow? This is something we're seeing today with new features like Skill Events, Progressive Responses, and most importantly Push Notifications (in fact, we touched on the testing aspect of this in a post a few months ago).

And taking this notion of new flows even further, what happens when you're working with entirely new modes of interaction? As noted above, the test console now includes a feature for testing display directives, but that specifically matches the form factor of the Echo Show. What about the brand new Echo Spot? The circular screen of the Spot has serious implications on how we use display templates! And most recently, the release of Echo Buttons and the Alexa Gadgets API means we have a completely novel input-output mechanism that we have to account for in our skills.

So while the skill quality team is pumping out helpful updates, it's probably safe to assume they're not gonna be lacking work anytime soon.


The Solvable-But-Unsolved

So, we've talked about the (considerable) effort Amazon has made since we wrote the original Treatise on Testability. We've looked at a variety of the new challenges they're facing as they accelerate their development. The one thing we haven't yet touched on is what known issues from early in the platform have yet to be addressed. And really, when you boil it down, there's only one major component missing. But as far as we're concerned, that piece is the cornerstone. We need the ability to know how Alexa will take an audio input and resolve it via our skill's intent model. 

Consider a case where a developer has built a skill, and needs assurance that it will work across a variety of voices - crossing boundaries of age, gender, regional accent. They could certainly spend the time and money to manually test the skill with a broad spectrum of users, and that would likely be a good upfront investment. But what happens the next time the developer wants to iterate on the skill. This idea of continual improvement and iteration of skills is something Amazon has been talking about of late. Repeating that sort of heavy-handed manual testing is certainly not sustainable as part of a continuous delivery pipeline.

Even happy path testing for a multi-lingual (or multi-locale English) skill becomes a huge pain, as you're essentially having to repeat the same flows over and over. As the Alexa tidal wave rapidly spreads to a global phenomenon, manual testing is going to become more and more of a losing proposition. As a tester by trade, and someone who has been more concerned with skill quality than 99% of other Alexa developers, even I have no interest in that sort of repetitive, hands on timesink. 

Both of the previous points have a lot to do with how you get a skill out the door, but there's also the problem of what happens once your skill is live. Sure, it's easy to make sure your Lambda or REST service doesn't change - you control that. But what about Alexa? You certainly don't "control" Alexa, and it's pretty easy to argue that nobody really does. There are hundreds (maybe thousands?) of engineers, all of them smarter than me, working daily on changing how the end-to-end flow works. And for the most part, they're making things better. But machine learning, natural language understanding, etc are all inexact sciences. A change that makes 95% of skills resolve their intents as-good-or-better is still making 5% of them resolve worse. How do you know if your skill is in that 5%?

This may sound like an abstract problem, but it happens! When our first two skills (CompliBot and InsultiBot) were in their initial certification phase, we actually saw this happen live. We were lucky enough to be doing a series of manual test passes over a list of 40-some utterances that we wanted to work at launch, many of which included the AMAZON.US_First_Name slot. While we we were working through the tests, all of a sudden we started getting incorrect responses due to the slot mapping to obscure (but phonetically similar) names. Instead of "give me an insult" we were getting mappings along the lines of "give Maeve an insult". As it happens, Maeve IS an English name... one that fell almost completely out of usage in the late 1800s. Right before our eyes, Amazon had expanded that slot's source data from the top 1000 US names to the top 10,000 or so, thereby greatly increasing the chance of collisions and false positive matches. Great for accommodating people with less common names, but it completely messed up the way our skill mapped one really important word.

(As an aside, the Alexa team at the time - late 2015 - was super weird about the whole thing. Despite the fact that we had multiple sets of clear baselines from having run through the test suite repeatedly and recorded our results, they refused to acknowledge that anything had changed. It's one thing to be secretive about your products, but it's another entirely to just lie for no gain. It was early days for the platform, and they were still figuring out how to interact with the community at the time, but it was strange nonetheless.)

The platform has continued to change and evolve today, and presumably changes to the intent mapping today go through A/B tests and produce tons of metrics for Amazon's internal folks before being applied broadly. But that doesn't mean that negatively impacting changes don't make it out the door. Over the first couple of months of this year, we've seen a flood of reports from some of the most well-respected and skilled developers describing a flurry of issues with how one-shot invocations work. And many of these folks are only finding out about it because they are beginning to get negative reviews.

And it's all because there's no way to know how Alexa is going to massage and remold your intent model when the underlying architecture and training data changes.

Just the price you pay, or can we fix it?

So, that situation is likely to make developers pull their hair out. Your system might break, you have no control over it, and no way to know if/when it's going to happen. What can anyone even do about it?

As far as we can see, there are three approaches:
  1. Amazon can just never stop iterating on Alexa. This is a dumb idea, I feel bad for writing it, and I certainly hope you aren't sitting there nodding your head in agreement right now. But technically speaking, it would solve the problem.
  2. Provide opt-in developer notifications when things change. We actually pushed for this early on - it made sense when Alexa was a speculative technology with a small development team pushing out modest updates occasionally. Today, the Alexa org is a behemoth that is becoming a new pillar of Amazon's business, and as a developer I'm certainly not interested in having my inbox fill up completely with "New Alexa change, retest your skills" emails.
  3. ------------> Give us a way to test through the front door. <---------------

Well, when you lay it out like that, it seems pretty obvious. But what do we mean by "test through the front door"?

Doors (comma) Front

Faux Test Harness
It's a little sad that this (sweet) image from our DERPGroup days is still relevant two years later.
We've written about this ad nauseum, but this problem is solvable. It was solvable two years ago when we wrote the first article, and in fact we tried to solve it on our own last year after getting tired of waiting (unfortunately, the tools provided to us just aren't quite adequate for the hack we were throwing together). The basic premise is basically this:

Amazon builds a REST service. The input to that service is some audio. The output of that service is the JSON that would've been sent to our skill for that audio input.

That's it. That's all we need, and we can do continuous testing. We can do full regression of our voice model for tiny little changes. We can record test snippets for different types of voices and make sure our skills are inclusive in how they resolve. We can invest in porting our skills to other languages, knowing we don't need to go bug our one German-speaking friend every time we make an update.

It doesn't need to do anything more than that, because as mentioned, Amazon and third parties have already covered testing the rest of the flow. This is the final missing cog, and once we plug it in, the gears of quality can start cranking.

What do you say, Amazon? Lets get this done.

Agree that this is a good idea? We have a UserVoice request up now. Disagree? We'd love to hear a solid counter-argument - hit us up in the comments or directly via email/Twitter.
3 Comments
http://www.topaperwritingservices.com/review-rushessay-com/ link
7/4/2018 09:51:21 pm

Amazon's Alexa is one of the proofs of how technology is continually changing throughout the years. Amazon Alexa is a device that acts as a virtual assistant to those who possess it. It costs around ninety dollars and is sold by the company Amazon. My parents gave me an Amazon Alexa as a Christmas gift and I can say that it is not worth its price. The problem with the device is that it does things that you can easily do with your phone. I do not see the reason behind wanting to own a device that can do exactly what you can. Still, I have to give credit to its creators because it is widely popular in the United States.

Reply
best paper writing service reviews link
3/20/2019 04:38:20 am

This is the retirement many of my generation now see. As 10,000 baby boomers reach age 65 every day, I am not alone. I was drawn to retirement coaching because I wondered what I would do with my own retirement, But as I learned more about how it works, the gift of helping others find meaning and fulfillment in their retirement pushed me ahead as I earned my certification. Clients come to me seeking a variety of answers.

Reply
MckimmeCue link
1/23/2022 10:39:31 pm

Great article! Thank you for sharing this informative post, and looking forward to the latest one.
If you are looking for coupon codes and deals just visit coupon plus deals dot

Reply



Leave a Reply.

    Author

    We're 3PO-Labs.  We build things for fun and profit.  Right now we're super bullish on the rise of voice interfaces, and we hope to get you onboard.



    Archives

    May 2020
    March 2020
    November 2019
    October 2019
    May 2019
    October 2018
    August 2018
    February 2018
    November 2017
    September 2017
    July 2017
    June 2017
    May 2017
    April 2017
    February 2017
    January 2017
    December 2016
    October 2016
    September 2016
    August 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015

    RSS Feed

    Categories

    All
    ACCELERATOR
    ALEXA COMPANION APPS
    BOTS
    BUSINESS
    CERTIFICATION
    CHEATERS
    DEEPDIVE
    EASTER EGG
    ECHO
    FEATURE REQUESTS
    MONETIZATION
    RECAP
    RESPONDER
    TESTING
    TOOLS
    VUXcellence
    WALKTHROUGH

Proudly powered by Weebly
  • Blog
  • Bots
  • CharacterGenerator
  • Giants and Halflings
  • The Pirate's Map
  • Responder
  • Neverwinter City Guide
  • About
  • Contact