Summary: Can one use ML to verify ML-based systems? This post claims the answer is mostly “no”: You mainly have to use other system verification methodologies. However, some ML-based techniques may still be quite useful.

How does one verify ML-based systems? A previous post in this series claimed that the “right” way is CDV: Essentially, define what the system should and should not do, and write a verification environment (VE) to automate the corresponding scenario creation, checking and coverage collection (more on this below).

But (I hear you say) ML is a brand-new category with its own rules, and we hear ML is eating the SW world. Isn’t it a bit old-fashioned to assume that it should be verified like “old-style” SW-based systems?

After considering this question for a while, my short answer is “no”.

My longer answer is: Some ML techniques can actually be helpful in verifying ML-based system (and most of this post is devoted to that). But overall, verifying an ML-based system is just like verifying any complex system (except that ML-based systems are harder to verify because they are opaque).

The rest of this post will discuss:

What ML verification even means

The mainstream, CDV way I suggest for verifying ML

Problems specific to ML verification, and how we can, perhaps, use ML to help solve them

While I will be talking at length about using-ML-to-verify-ML, I urge you again, oh ML practitioners and researchers, to not be guided by the beauty of your weapons (as Leonard Cohen used to say). For serious, safety-critical systems, one needs serious, systematic verification.

The fact that (some of) the system is implemented via ML may make the implementation somewhat easier, and the verification somewhat harder. And ML-based systems are susceptible to some special classes of bugs (see Concrete problems in AI safety – to be covered in a subsequent post). But the principles of verification should probably stay unchanged.

Note that the usual caveats apply: ML is a fast-moving field, and I may have missed some ML developments related to verification. And I am not an ML expert (which may be an even bigger problem in the current post). Sometimes I feel even my imposter syndrome is somewhat fake 😉

So comments are very welcome.

What does “verifying ML-based systems” even mean?

I’ll talk here about ML-based systems for which verification is really important. Autonomous systems are a good example. By “verification” I mean “finding bugs before they annoy / kill somebody”. I am mainly talking about functional verification, but (as I described here) security verification, and especially fuzzing, is quite similar.

Note that most autonomous systems are not implemented solely via ML (though some are – e.g. the system created by this guy). Most systems currently use ML for just some of the pieces, often the “lower levels” such as perception and object identification.

I have covered this fact in a previous post, where I described how some people (e.g. Michael Fisher of Liverpool U) claim the upper levels should always be handled via some “logical”, readable, non-ML description, precisely because of the verification / debug issues.

For some interesting insights on autonomous vehicles (including verification) listen to this 25-minute interview of Sterling Anderson (director of autopilot programs, Tesla). He talks about the enormous difficulty of ironing out all the corner cases (especially since Tesla is driven in lots of different countries with different road conditions and driver cultures).

Then, at the 21:00 mark, he is asked about the importance of NNs, and he says they are very important, but can’t be the whole picture: Because there is such a long tail of corner cases, you can’t just “turn the crank” on the ML algorithm to fix them. Rather, those corner cases require “specific tuning of specific algorithms” (which I take to mean “SW modules”).

Note that this part (“fixing” the behavior of an ML system using hand-written SW) is itself a delicate, bug-prone process – I happen to know that from bitter experience.

OK. Assume then that we are verifying an ML subsystem in the context of a bigger system, say an Autonomous Vehicle (I have covered AV verification e.g. here and here). We should probably verify the ML subsystem both independently, and in the context of the full system (hopefully we’ll be able to reuse verification artifacts between the two environments).

In either case, the verification effort should probably mostly consist of:

Defining what the system should and should not do

Creating a verification environment capable of reaching all interesting scenarios / scenario parameters

Automatically checking all those scenarios while tracking coverage

Writing new tests to fill coverage holes

Debugging failures and fixing them

Repeating this whole process, with as much automation as possible, after bugs were fixed or new possible scenarios were discovered

So far, this is classical, non-ML CDV. However, as we shall see in the next few chapters, some of these bullets have specific problems / quirks in ML-land, and some of those may have ML-based solutions.

Here are some important things I plan to ignore in rest of this post:

There are, of course, other verification techniques (e.g. formal verification), but I assume the bulk of the verification will be done via something like CDV, as described above. I discussed those other techniques here.

ML systems which learn-on-the-job (i.e. do further training during deployment) are even harder to verify. Thus, AV companies (for instance) are mostly going to avoid that. This is an important topic, but does not fit in the current post.

Similarly, the whole topic of runtime verification / monitoring will be ignored.

Finally, this is a post about verification, so the whole (hugely important) topic of engineering ML-based systems so as to avoid bugs will also be ignored.

OK, let me now go through some of these verification tasks, more-or-less in reverse order:

Debugging failures and fixing them

Debugging turns out to be much harder for ML-based systems, mainly because these systems are opaque and have no (relevant) “source code” to use a debugger on.

In the same MIT technology review which featured the Tesla talk above, Peter Norvig (director of research, Google) talks about ML state-of-the-art. At the 7:30 mark he says that the really hard parts of ML verification are debugging, and fixing a bug while leaving everything else unchanged (summary is here, and a longer, very interesting video on ML-vs-SW is here).

I have talked before about Explainable AI, and why it might help debugging. Most of the techniques mentioned there are themselves ML-based.

BTW, If you search for “debugging machine learning” you will find that most of the advice is on how to debug your ML-training process once you find some misclassifications: Perhaps some of the labeling was wrong, or your ML model was inadequate, or you did not have enough training data etc..

But debugging an ML system starting (say) from a full-system AV failure can be even harder:

You need to understand what the bigger system was expecting from the ML subsystem

You need to be able to reconstruct the inputs and outputs of the ML subsystem (harder if those are continuous streams)

Often you need to think about whether this was even assumed to be a legal input to the ML subsystem

So this is less about “why did my ML system misclassify a cat as a dog” and more about “Oh, I see – it misclassified a person painted on a bus as a person on the road. Did we ever consider that? Under extreme fog? Shouldn’t the LIDAR handle that?”. This kind of debugging needs a good system integrator, and there are never enough of those (and they may not all understand the funky ML stuff).

Finally, as I mentioned briefly here, ML has the potential to help debugging of any system, e.g. in:

Failure clustering: A single bug can cause many failures in many test runs. ML can be used to help cluster together the failures such that each cluster (hopefully) corresponds to a single bug.

Automatic debugging: This slight misnomer refers (among other things) to techniques which help you understand the bug by suggesting what is unique in the runs leading to it. This is a bit like anomaly detection, and ML algorithms can be pretty helpful there.

Checking and coverage collection

Checking ML-based systems is hard for various reasons. One is that the rules are often probabilistic rather than binary – I have discussed this at length here. Another is the lack of modularity.

Coverage collection (and in fact even coverage definition) is hard because ML systems are opaque, and thus there is no way to collect meaningful internal-state coverage. You can usually cover the input space (e.g. the kinds of scenarios the VE was throwing at your DUT). But you probably can’t cross the scenario-kind with some internal state (and you often want to that).

ML techniques leading to Explainable AI may help that – see here. I also mentioned there the idea of creating a second ML system just for exporting human-readable information for verification.

Note that if we could create a simplified model (e.g. state-machine-based) of an ML system, this could help generation, checking and coverage, even if that model was incomplete. There exist ML-based systems (e.g. LearnLib) for learning a state machine by just observing system inputs and outputs.

Scenario creation and execution

We are talking here about:

Creating randomized, interesting scenarios (with all their many parameters)

Executing them (perhaps mixing several scenarios together)

Interfacing them to the DUT’s inputs and outputs

Together, this is much of what verification people spend their time on. This is always hard to do well, and is even more challenging in ML verification. I’ll discuss below some possible ML-based solutions:

Converting abstract scenarios into actual inputs is often hard. For instance, to verify the vision ML system in an AV one may need realistic, coordinated LIDAR and video input streams – that’s very hard to do.

One idea for making scenario generation easier is to use Generative Adversarial Networks (GANs). A GAN works as follows: Suppose you already have an ML classifier C (say a classifier from an image to the label “cat”). You now create an ML generator G, whose job is to produce images which can fool C into classifying them as a cat. Over time, C improves (i.e. gets better at distinguishing cats) and G improves (i.e. gets better at faking cats).

People have done amazing things with GANs. For instance, this paper combined a GAN with a Recurrent Neural Network (RNN) to create a system which generates images (e.g. of flowers) based on a text description (e.g. “a flower with long pink petals and raised orange stamen”).

My hope is that such turbo-charged GANs can be used for generating random-but-realistic scenarios / objects, perhaps based on text descriptions. For instance, if your AV has an ML-based image-to-list-of-objects recognizer, perhaps we can create a GAN which (given a list of objects, or some text description) will create a corresponding, random image. This can be quite valuable, especially if it can also generate input streams.

BTW, I have a vague sense that the value of scenario-generation-via-GAN will be limited if we use the DUT itself as part of the GAN (because the generation will be modulated by the very ML we want to test), but there may be ways around that.

There are also GAN-like ML systems which can combine two artifacts to create a coherent whole (e.g. combining the style of one image with the high-level objects of another image). You may be familiar with Google’s Deep Dreams: It lets you impose an arbitrary “style” (e.g. “colorful paint”) on an arbitrary uploaded image. Perhaps similar ideas can be used for combining multiple scenarios, or for superimposing a pattern on a data stream.

Finally, you may have seen (e.g. in the excellent The unreasonable effectiveness of Recurrent Neural Networks) how people use RNNs to create random text which looks like something by Shakespeare, or like a Wikipedia article, or like valid XML and so on. I hope we can use similar techniques for creating “reasonable” input streams to test ML-based systems.

Note that as you train these systems (e.g. by going over recorded inputs), they will initially just learn the low-level lexical structure, and then learn deeper and deeper structure. These are all good for verification: We want to bombard our DUT with both lexically-correct-but-meaningless-inputs and more-meaningful-inputs. Fuzzer writers especially should love this.

Automating coverage maximization

This post discusses using ML for coverage maximization (i.e. the automatic filling of coverage holes). It should ideally work for any system, and thus also for ML-based systems.

Another trick which could also, perhaps, help coverage maximization is something called “adversarial examples”. Let me explain:

In addition to the GANs described above, ML people (who seem to really like the word “adversarial”) have discovered simple techniques which can be used to fool ML systems. For instance, one can take a picture of a stop sign, and (by changing very few pixels) cause it to be reliably classified as a yield sign, as described by this article. Essentially you do that (I think) by looking at the derivative of the surface separating the two classifications.

This article says that the same fooling examples work for many kinds of ML systems (NNs, SVNs, decision trees, whatever). Finally, this article tries to explain why this is: fooling examples should often transfer well to all linear models. It also claims that the correctly-classified inputs form only “thin manifolds” in the input space. These are surrounded by fooling examples, but most of the input space consists of “rubbish examples” – inputs which are far from anything considered during the training.

OK, so all this sounds important for security (assuming the bad guys want to purposely fool an ML-based machine, for whatever nefarious purpose). But how important is it for us verification people (assuming we are not verifying security)?

Clearly, the VE should check that the DUT does not misclassify (too often). And these adversarial examples are clearly misclassifications. But will these kinds of misclassified inputs appear in real life in any meaningful number, or are they rare, unrealistic “just-so” cases?

Not sure about that: I suspect they don’t spontaneously appear too often (and that last article above seems to agree). But perhaps we can learn from them about changes which would appear often. I think this is an interesting research question: producing “reasonable adversarial examples”, i.e. inputs which are mislabeled, but which are optimized not for minimal, hopefully-invisible difference from a correctly-labeled input, but rather for being “reasonable” or “possible” in some way.

BTW, producing misclassified inputs is nice, but often what we want to produce is inputs for which there is no label in the original training (e.g. an approaching Tsunami). Another interesting research question is whether one can produce such inputs automatically – I suspect the answer is “no”.

To summarize:

ML-based systems are going to be hard to verify

Most of the heavy lifting should be done via non-ML methodologies

But ML-based techniques can help in various ways

I’d love to learn about other ML-for-ML-verification ideas.

Notes

I’d like to thank Amiram Yehudai, Kerstin Eder and Thomas (Blake) French for commenting on previous versions of this post.

[Added 14-Sep-2016: Some Reddit comments are here]