Ignore for the moment that this is a nigh impossible task to do in a repeatable way, since they provide no test automation hooks - the real kicker here is that they are constantly shifting the goalposts in this regard. In the same way that we are capable of modifying our services after certification, the Alexa team is capable of modifying their voice model in a way that breaks production services. Now, you may be saying to yourself "Sure, Amazon could do that, but they wouldn't, because they are a rational business who wants to maintain a high level of quality of their products", and you would be absolutely wrong.
As part of our initial rejection, the certification team gave us some suggestions they thought might help the hit-rate of our voice interface (it's worth noting that this sort of exchange is exactly the type of helpful thing the certification team could be spending its time on, instead of policing an unnecessary spec). We went to work trying out these changes and undergoing a rigorous manual test pass after each change, comparing with various baselines. After one particularly late evening of testing, we came back the next morning to see that things were no longer working, and we couldn't understand why. Nothing on our end had changed at all (as an aside, this is exactly why baselines are valuable), but all of a sudden our successful invocations had begun failing. We tried getting input from Amazon employees on the forums, to no avail.
The running theory here is that Amazon decided to change the enumeration they use to describe US first names. Originally it was built as a list of the top N (maybe 1000?) names in the US? Over that night, they seem to have changed it to include every single name registered in the US Census. We know this because we were able to look at our new found failures and see that they were tied to phrases coming in that matched words we had never seen before. What we noticed was that the pronoun "me" was all of a sudden being replaced by obscure names.
There had always been the occasional false positive hit for the name "May" instead of the word "me", but all of a sudden we were seeing a massive spike in the failure rate, with words like "mi" and "mei" also showing up where they hadn't before. The real smoking gun, though, was when we started matching "Mayme". If this word looks obscure to you, that's because it is. It is a female name that peaked in the 1880s, and has been almost non-existent among new births in the last 100 years. It exists in no form other than as a proper noun, and we had seen exactly zero instances of it before this mystery event. After the event, it became a fixture of our newly flakey interface.
This is simply not a sustainable model. To bring things full circle, I think the takeaway is that this is not a two-way street between the Alexa team and the development community. They've designed a paradigm wherein we, as developers, are not able to make decisions about the best way to implement our own code, while at the same time they've undertaken a weirdly silent and proprietary approach that doesn't allow us to account for the changes they force on us.
It feels very much as if they don't actually want us there - as if they feel they are doing us a service by opening up for development, but that it's a service they absolutely disdain.
It's not hard to see the poison seeping into their new Alexa ecosystem day by day, and that's a real shame.